Please enable JavaScript.

Coggle requires JavaScript to display documents.

Machine Learning (Learning Styles (Supervised learning (SVM faster…

- - - - no kernel
        
        Cost function
        
        \( C*\sum_{i=1}^m{y^{(i)}cost_1(\theta^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta^Tx^{(i)})} + \frac{1}{2}\sum_{j=1}^n{\theta_j^2} \)
      - kernel
        
        landmarks
        
        all training examples
        
        similarity functions
        
        Gaussian
        
        \( f_i = exp(-\frac{{\Vert x-l^{(i)} \Vert}^2}{2\sigma^2}) \)
        
        Polynomial
        
        \( k(x,l) = {(x^Tl + a)}^b \)
        
        String
        
        chi-square
        
        Histogram intersection
        
        cost function
        
        \( C*\sum_{i=1}^m{y^{(i)}cost_1(\theta^Tf^{(i)}) + (1-y^{(i)})cost_0(\theta^Tf^{(i)})} + \frac{1}{2}\sum_{j=1}^m{\theta_j^2} \)
      - practical
        
        decision
        
        Kernel
        
        C
        
        feature scaling x #
        
        \( \frac{x-\mu}{range} \)
        
        multi-class
        
        use package
        
        one vs all
        
        usage
        
        No. feature large
        
        logistic regression
        
        no kernel SVM
        
        No. feature small, No. example intermediate
        
        Gausian kernel SVM
        
        No. features small, No. example large
        
        create/add feature manually
        
        logistic regression
        
        no kernel SVM
    - - algorithms
        
        Linear Regression
        
        practice
        
        feature
        
        polynomial
        
        feature scaling
        
        combine
        
        learning rate
        
        core
        
        optimization #
        
        gradient descent: \( \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \)
        
        normal equation: \( \theta = {(X^TX)}^{-1}X^Ty \)
        
        advance
        
        BFGS
        
        L-BFGS
        
        Conjugate gradient
        
        hypothesis: \( h_{\theta}(x)=\theta^T\textbf{x}\)
        
        cost function: \( J(\theta) = {\frac{1}{2m}\parallel h_{\theta}(x) - y \parallel}^2 \)
        
        Logistic Regression
        
        multiclass
        
        one vs all
        
        core
        
        cost function: \( J(\theta) = \frac{1}{m}.(-y^Tlog(h)-(1-y)^Tlog(1-h)) \) with \( h=g(X\theta)\)
        
        decision boundary: \( \theta^Tx \)
        
        base on
        
        \( p(C_1|x) = h_\theta(x) \)
        
        maximum likelihood
        
        optimization
        
        advance
        
        Conjugate gradient
        
        BFGS
        
        L-BFGS
        
        gradient descent: \( \theta := \theta - \frac{\alpha}{m}X^T(g(X\theta)-\vec{y}) \)
        
        hypothesis: \( h_{\theta}(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}} \)
      - overfitting
        
        Regularization
        
        reduce all \( \theta_j \)
        
        usage: many slightly useful features
        
        cost function: \( J_{Reg}(\theta) = J(\theta) + \frac{\lambda}{2m}\sum_{j=1}^n{\theta_j}^2 \)
        
        optimization
        
        gradient descent:
        \( \theta_0 := \theta_0 - \alpha \frac{1}{m} X_0^T(h_\theta -y) \)
        \( \theta_j := \theta_j - \alpha[ \frac{1}{m} X_j^T(h_\theta-y) + \frac{\lambda}{m}\theta_j \quad j \in \{ 1,2 \dots n\} \)
        
        linear regression
        
        normal equation: \( \theta = (X^TX + \lambda L)^{-1}X^Ty \)
        \(L = \begin{pmatrix} 0 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ \vdots & \vdots & \dots & \vdots \\ 0 & 0 & \dots & 1 \\ \end{pmatrix} \)
        
        Reduce features
        
        model selection algorithm
        
        manually
  - - - applications
        
        market segmentation
        
        social network analysis
        
        Organize computing cluster
        
        astronomical data analysis
      - algorithm
        
        K means
        
        input
        
        K (No cluster)
        
        choose K
        
        classification purpose
        
        training set x
        
        procedure
        
        randomly initialize centroids \( \mu_1, \mu_2 .. \)
        
        clusters assignment
        
        \( \mu_k =\) average mean points belong to cluster k loop back
        
        Optimization objective
        
        \( J(c^{(1)} \dots c^{(m)}, \mu_1 \dots \mu_K) = \frac{1}{m}\sum_{i=1}^m{\Vert x^{(i)} - \mu_{c^{(i)}} \Vert}^2 \) minimize J respect to c min J respect to u
        
        local optimal #
        
        random initialize many times
    - - motivation
        
        data compression
        
        visualization
        
        speedup supervise learning
        
        calculate \( U_{reduce} \) on test set only
      - Principal Component Analysis (PCA)
        
        goal
        
        reduce n-dimension to k-dimension
        
        task
        
        find k vectors \( \mu_1 \dots \mu_k \)
        
        min projection distance
        
        procedure
        
        Preprocessing (feature scaling/mean normalization)
        
        compute covariance matrix
        \( \Sigma = \frac{1}{m}\sum_{i=1}^m( x^{(i)})(x^{(i)})^T \)
        
        compute eigenvectors of \( \Sigma \)
        
        choose first k eigenvectors to get \( U_{reduce} \)
        
        1 more item...
        
        practical
        
        reconstruction
        
        \( x_{approx} = U_{reduce}z \)
        
        choose k
        
        99% of variance retained
        
        \( \frac{ \frac{1}{m}\sum_{i=1}^m{\Vert x^{(i)} - x_{approx}^{(i)} \Vert}^2}{\frac{1}{m}\sum_{i=1}^m{\Vert x^{(i)}\Vert}^2} \leq 0.01\)
        
        use eigenvalues
        
        bad use
        
        reduce overfitting
        
        use PCA right at the beginning without any reason
        
        better use raw data first
      - Linear Discriminant Analysis (LDA)
        
        direction: \( \textbf{w} \) (unit vector)
        
        mean: \( \textbf{m} \)
        
        covariance matrix: \( \textbf{C} \)
        
        maximize \( \frac{(\textbf{w}^t\textbf{m}_1-\textbf{w}^t\textbf{m}_2)^2}{\textbf{w}^t\textbf{C}_1\textbf{w} + \textbf{w}^t\textbf{C}_2\textbf{w}} \)
    - - problem
        
        normal dataset \( \{ x^{(1)} \dots x^{(m)} \} \)
        
        \( x_{test} \) anormalous?
      - solution
        
        model \( p(x) \)
        
        anormal if \( p(x) < \epsilon \)
        
        algorithm
        
        Univariate Gaussian distribution
        
        choose \( n \) feature \( x_i \) that might be indicative of anormalous examples
        
        fit parameters:
        \( \mu_j = \frac{1}{m}\sum_{i=1}^m{x_j^{(i)}} \quad \sigma_j = \frac{1}{m}\sum_{i=1}^m{(x_j^{(i)} - \mu_j)^2}\)
        
        given \( x_{test} \)
        \( p(x_{test}) = \prod_{j=1}^n{\frac{1}{\sqrt{2\pi}\sigma_j}exp(-\frac{{\Vert x-\mu_j \Vert}^2}{2\sigma_j^2})} \)
        
        1 more item...
        
        characteristic
        
        manually create features
        
        \( x_1, x_2 \) take unusual combination of values
        
        computation cheaper
        
        work even with small training set size
        
        Multivariate Gaussian distribution
        
        choose \( n \) feature \( x_i \) that might be indicative of anormalous examples
        
        fit model:
        \( \mu = \frac{1}{m}\sum_{i=1}^m{x^{(i)}} \quad \sigma = \frac{1}{m}\sum_{i=1}^m{(x^{(i)} - \mu)(x^{(i)} - \mu)^T}\)
        
        given \( x_{test} \)
        \( p(x_{test}|\mu, \Sigma) = \frac{1}{{2\pi}^{n/2}\vert \Sigma \vert^{1/2}}exp(-\frac{1}{2}(x_{test}-\mu)^T \Sigma^{-1} (x_{test}-\mu)) \)
        
        1 more item...
        
        relation #
        
        features independence
        
        characteristic
        
        automatically capture correlation between features
        
        computation expensive
        
        need No. example > No features
        
        should No. example > 10 * No. features
        
        \( \Sigma \) non-invertible
        
        remove redundant features
      - examples
        
        fraud detection
        
        manufacturing
        
        computer in data center
      - evaluation
        
        determine \( \epsilon \) using cross-validation set
        
        metrics
        
        true pos, false pos, true neg, false neg
        
        Precision/Recall
        
        \( F_1 \)-score
      - compare #
        
        supervise learning
        
        large pos & neg example
        
        enough pos example to learn types
        
        anormaly detection
        
        small pos example
        
        many different type of anormalies
      - practical
        
        non-gaussian features
        
        transform \( x_i \) to gaussian distribution
        
        \( log(x_i+a) \)
        
        \( x_i^{\frac{1}{a}} \)
        
        error analysis
        
        choose features
        
        really large or small
        
        anormally event
- - - - get more data
  - - - mini-batch gradient descent
        
        batch size
      - stochastic gradient descent
        
        shuffle the data set
        
        repeat
        
        gradient descent for 1 example a time
        
        checking convergence
        
        compute cost before updating \( \theta \)
        
        average cost over last 1000 examples
        
        plot average cost
    - - divide workload
        
        different machines
        
        combine
      - data parallelism
  - - - repeated forever
        
        get \( (x,y) \) from the user
        
        update \( \theta \) using \( (x,y) \)
- - - - parameter for user \( \theta^{(j)} \)
      - feature vector for movie \( x^{(i)} \)
      - cost function
        
        \( \frac{1}{2} \sum_{j=1}^{n_u} \sum_{i: r(i,j)=1}{((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 } + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n}{(\theta_k^{(j)})^2} \)
      - other point of view
        
        cost function
        
        \( \frac{1}{2} \sum_{j=1}^{n_u} \sum_{i: r(i,j)=1}{((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 } + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}{(x_k^{(i)})^2}\)
        
        given \( \theta^{(1)}, \dots \theta^{(n_u)} \) learn \( x^{(i)} \)
    - - algorithm
        
        initialize \( x,\theta \) to small values
        
        minimize J using gradient descent
        
        calculate rating \( \theta^Tx \)
      - simultaneously optimize \( \theta, x \)
      - cost function
        
        \( J = \sum_{(i,j): r(i,j)=1}{((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 } + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}{(x_k^{(i)})^2} + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n}{(\theta_k^{(j)})^2} \)
    - - measure feature distance \( x \)
    - - problem user with no rating
        
        mean normalization all rating
  - - - Image
        
        Text detection
        
        character segmentation
        
        character classification
      - ceiling analysis
        
        first
        
        measure performance
        
        overall system
        
        from \( 1^{st} \) to last stage
        
        make current stage perfect and keep previous stage perfect
        
        measure performance overall system
        
        detect
        
        largest gain
        
        overall performance
    - - how
        
        artificial data
        
        sound
        
        background noise
        
        bad cell phone connection
        
        image
        
        distortion
        
        collect/label manually
        
        crowd source
      - first make sure classifier low bias
      - what is cost to get 10x more data
- - - - train: 60%
      - cross-validation: 20%
      - test: 20%
    - - (high) \( J_{train}(\Theta) \approx J_{CV}(\Theta) \) (high)
      - solutions
        
        Adding features
        
        Decreasing λ
        
        Adding polynomial features
        
        use more complex model
    - - (low) \( J_{train}(\Theta) \ll J_{CV}(\Theta) \) (high)
      - solutions
        
        Getting more training examples
        
        Trying smaller sets of features
        
        regularization
        
        Increasing λ
    - - learning curve
        
        \( J_{train}(N_{training\_examples}) , J_{CV}(N_{training\_examples}) \)
        
        calculate \( min(J_{train}) \) using gradient descent
    - - metrics
        
        precision/Recall
        
        \( Precision = \frac{true\_pos}{\#predicted\_pos} \)
        
        \( Recall = \frac{true\_pos}{\#actual\_pos} \)
        
        trade-off
        
        \( F_1 \) Score
        
        \( 2\frac{PR}{P+R} \)
  - - - start with simple algorithm and test on CV set
        
        Plot learning curves to decide if more data, more features, etc. are likely to help.
        
        Manually examine the errors on CV set and try to spot a trend where most of the errors were made.