Please enable JavaScript.

Coggle requires JavaScript to display documents.

Math for Machine Learning - Coggle Diagram

- - - - Lp-Norm
      - L1-Norm
      - Euclidean Norm (L2-Norm)
      - L∞-Norm
      - L0-Norm
        Despite the name, it's not a norm
        It's a number of non-zero elements of a vector
    - - All distances are non-negative
      - Distances multiply with scalar multiplication
      - If I travel from A to B then B to C, that is at least as far as going from A to C (Triangular Inequality)
- - - - Orthogonality v*w=0
      - vw<0 vw>0
    - - Decision plane
        
        1D
        
        2D
        
        3D
  - - - Distributativity
        A(B+C) = AB +AC
      - Associativity
        A(BC)=(AB)C
      - Not commutativity
        
        AB!=BA
    - - Distributativity
        A(B+C) = AB +AC
      - Associativity
        A(BC)=(AB)C
      - Commutativity
        AoB=BoC
  - - - Suppose A is 2x2 matrix (mapping R^2 to itself). Any such matrix can be expressed uniquely as a stretching, followed by a skewing, followed by a rotation
      - Any vector can be written as a sum scalar multiple of two specific vectors
        
        A applied to any vector
- - - - Matrix derivative
      - Vector derivative
      - The Gradient
        collection of partial derivatives
    - - points in the direction of maximum increase
      - - points in the direction of maximum decrease
      - at local max & min
  - - - 2D intuition
        
        Hf=
        
        Critical points
        =0
        
        det(Hf) < 0
        saddle point
        
        det(Hf) > 0
        further investigation
        
        tr(Hf) > 0
        local minimum
        
        tr(Hf) < 0
        local maximum
        
        tr(Hf) = 0
        does not happen
        
        det(Hf) = 0
        unclear
        need more info
        
        Hf=
        
        Hf=
      - If the matrix is diagonal, a positive entry is a direction where it curves up, and a negative entry is a direction where it curves down
- - - - Minimizing f <---> f '(x)=0
        
        Now look for an algorithm to find the zero of some function g(x)
        
        Apply this algorithm to f '(x)
    - - line: on (x 0, g(x 0))
        slope g '(x 0)
        y=g '(x 0) (x-x 0)+g(x 0)
        solve the equation y=0
    - - we want to find where g(x)=0 and we start with some initial guess x0 and then iterate
    - - To minimize f, we want to find where f '(x)=0 and thus we may start with some initial guess x0 and then iterate Newton's Method on f ' to get
  - - - how to pick eta
      - recall that an improperly chosen learning rate will cause the entire optimization procedure to either fail or operate too slowly to be of practical use.
      - Sometimes we can circumvent this issue.
    - - 1. Start with a guess of X0
      - 2. Iterate through
        - is learning rate
      - 3. Stop after some condition is met
        
        if the value if x doesn't change more than 0.001
        
        a fixed number of steps
        
        fancier things TBD
  - - - max -> f '' < 0
        min -> f '' > 0
        Can't tell -> f '' = 0
        proceed with higher derivative
  - - - Chain Rule
        
        Alternative
      - Product Rule
      - Sum Rule
      - Quotient Rule
    - - http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
- - - - Intersection of two sets
      - Union of two sets
      - Symmetric difference of two sets
      - Relative complement of A (left) in B (right)
      - Absolute complement of A in U
  - - - Given a probability model with some vector of parameters (Theta) and observed data D, the best fitting model is the one that maximizes the probability
  - - - Central limit theorem
        
        is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
      - Maximum entropy distribution
        
        Amongst all continuous RV with E[X]=0, Var[X]=1. H(X) Entropy is maximized uniquely for X~N(0,1)
        
        Gaussian is the most Randon RV with fixed mean and variance
    - - E[X]=0
        Var[X]=1
  - - - One coin
        Entropy = one bit of randomness
        
        H (1/2)
        
        T (1/2)
      - Two coins
        Entropy = 2 bits of randomness
        
        H
        
        HH (1/4)
        
        HT (1/4)
        
        T
        
        TH (1/4)
        
        TT (1/4)
      - A mixed case
        Entropy = 1.5 bits of randomnes =
        =1/2(1 bit) + 1/2(2 bits)
        
        H (1/2)
        
        T
        
        TH (1/4)
        
        TT (1/4)