Please enable JavaScript.
Coggle requires JavaScript to display documents.
Math for Machine Learning - Coggle Diagram
Math for Machine Learning
Vectors
:arrow_right:
Geometry of Column Vectors.
Operations
Vectors as Directions
Scalar Multiplication
Addition as Displacement
Subtraction as Mapping
Measures of Magnitude
Geometry of Norms
Types of Norms
Lp-Norm
L1-Norm
Euclidean Norm (L2-Norm)
L∞-Norm
L0-Norm
Despite the name, it's
not
a norm
It's a number of non-zero elements of a vector
Norms
- are measures of distance
Norm Properties
All distances are non-negative
Distances multiply with scalar multiplication
If I travel from
A
to
B
then
B
to
C
, that is at least as far as going from
A
to
C
(Triangular Inequality)
Matrices
:bookmark_tabs:
Matrix multiplication and examples
If A is a matrix where the rows are features wi and B is a matrix where the columns are data vectors vj then the i,j-th entry of the product is wi*vj, which is to say the i-th feature if the j-th vector.
In formulae: if C=AB, where A is an
n x m
matrix and B is a
m x k
matrix, then C is a
n x k
matrix where
Well defined:
Dot product and how to extract angles
Key Consequences
Orthogonality
v*w=0
vw<0 vw>0
Angles
Hyperplane
- is the thing orthogonal to a given vector
perpendicular line in 2D
perpendicular surface in 3D
Decision plane
1D
2D
3D
Dot Product
- the sum of the products of the corresponding entries of the two sequences of numbers.
Matrix product properties
Matrix Products
Distributativity
A(B+C) = AB +AC
Associativity
A(BC)=(AB)C
Not commutativity
AB!=BA
The Identity Matrix
IA=A
All ones on the diagonal
Properties of the Hadamard Product
Distributativity
A(B+C) = AB +AC
Associativity
A(BC)=(AB)C
Commutativity
AoB=BoC
Linear dependence
det(A)=0 only if columns of A are linearly dependent
Definition
lies in lower dimentional space
if there are some a-s, that
a1*v1+a2*v2+...+ak*vk=0
Example
a1=1, a2=-2, a3=-1
Hadamard product
An (often less useful) method of multiplying matrices is element-wise
AoB
Geometry of matrix operations
Intuition from Two Dimentions
Suppose A is 2x2 matrix (mapping R^2 to itself). Any such matrix can be expressed uniquely as a
stretching
, followed by a
skewing
, followed by a
rotation
Any vector can be written as a sum scalar multiple of two specific vectors
A applied to any vector
The Determinant
det(A) is the factor the area is multiplied by
det(A) is negative if it flips the plane over
Determinant computation
The Two-by-two
det(A)=ad-bc
Larger Matrices
m determinants of (m-1)x(m-1) matrices
computer does it simplier Q(m^3) times
called matrix factorizations
Matrix invertibility
When can you invert?
it can be done only when det != 0
How to Compute the Inverse
A^(-1)*A=I
Multivariate Derivatives
:silhouettes:
Convexity
Derivative Condition
Hf is called positive semi-definite if
And this implies f is convex
A Warning
When going to more complex models i.e. Neural Networks, there are many local minima & many saddle points.
So they are not convex
Opposite is Concave
the function is convex if the line between two points stays above
Benefits
When a function is convex then there is
a single
unique
local minimum
no maxima
no saddle points
Gradient descent
is guarantied to find the global minimum with
small enough
Newton's Method always works
The Gradient
Definitions
Matrix derivative
Vector derivative
The Gradient
collection of partial derivatives
Gradient Descent
Level set
Visualize
Key Properties
points in the direction of
maximum increase
-
points in the direction of
maximum decrease
at local max & min
Second Derivative
Hessian
2D intuition
Hf=
Critical points
=0
det(Hf) < 0
saddle point
det(Hf) > 0
further investigation
tr(Hf) > 0
local minimum
tr(Hf) < 0
local maximum
tr(Hf) = 0
does not happen
det(Hf) = 0
unclear
need more info
Hf=
Hf=
If the matrix is diagonal, a positive entry is a direction where it curves up, and a negative entry is a direction where it curves down
Trace
sum of diagonal terms tr(Hf)
For
, there is
many derivatives
If you have a function
where n - n-dimensional vector,
Then it's a function of many variables.
You need to know how the function responds to changes in all of them.
The majority of this will be just bookkeeping, but will be terribly messy bookkeeping.
Partial Derivatives
is a measure of the rate of change of the function... when one of the variables is subjected to a small change but the others are kept constant.
Example
:
Newton Method
The computational complexity of inverting an
n x n
matrix is not actually known, but the best-known algorithm is
O(n^2.373)
For high dimensional data sets, anything past linear time in the dimensions is often impractical, so Newton's Method is reserved for a few hundred dimentions at most.
Matrix Calculus
Univariate Derivatives
:silhouette:
Newthon's method
Idea
Minimizing f <---> f '(x)=0
Now look for an algorithm to find the zero of some function g(x)
Apply this algorithm to f '(x)
Computing the Line
line: on (x 0, g(x 0))
slope g '(x 0)
y=g '(x 0) (x-x 0)+g(x 0)
solve the equation y=0
Relationship to Gradient Descent
A learning rate is adaptive to f(x)
Update Step for Zero Finding
we want to find where
g(x)=0
and we start with some initial guess x0 and then iterate
Pictorially
g(x) x such that g(x)=0
Update Step for Minimization
To minimize
f
, we want to find where
f '(x)=0
and thus we may start with some initial guess x0 and then iterate Newton's Method on
f '
to get
Gradient Descent
As simplistic as this is,
almost all
machine learning you have heard of use
some version
of this in the
learning process
Goal
: Minimize f(x)
Issues
:
how to pick eta
recall that an
improperly chosen
learning rate will cause the entire optimization procedure to either
fail
or
operate too slowly
to be of practical use.
Sometimes we can
circumvent
this issue.
ALGORITHM
1
. Start with a guess of X0
2.
Iterate through
- is learning rate
3.
Stop after some condition is met
if the value if x doesn't change more than 0.001
a fixed number of steps
fancier things TBD
Maximum Likelihood Estimation
find
p
such that
Pp(D)
is maximized
Second Derivative
f''(x)
shows how the slope is changing
max
-> f '' < 0
min
-> f '' > 0
Can't tell
-> f '' = 0
proceed with higher derivative
Derivative
Can be presented as:
Interpretation
let's approximate
better approximation
f(x +є) = f(x) + f'(x)є
Rules
Chain Rule
Alternative
Product Rule
Sum Rule
Quotient Rule
Most usable
http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
Probability
:game_die:
Axioms of probability
Something always happens.
The fraction of the times an event occurs is between 0 and 1.
If two events can't happen at the same time (disjoint events), then the fraction of the time that
at least one
of them occurs is the sum of the fraction of the time either one occurs separately.
Terminology
Outcome
A single possibility from the experiment
Sample Space
The set of all possible outcomes
Capital Omega
Event
Something you can observe with a yes/no answer
Capital E
Probability
Fraction of an experiment where an event occurs
P{E} є [0,1]
Visualizing Probability
using Venn diagram
Inclusion/Exclusion
Intersection
of two sets
Union
of two sets
Symmetric difference
of two sets
Relative complement
of A (left) in B (right)
Absolute complement
of A in U
General Picture
Sample Space <-> Region
Outcomes <-> Points
Events <-> Subregion
Disjoint events <-> Disjoint subregions
Probability <-> Area of subregion
Conditional probability
If I know B occurred, the probability that A occurred is the fraction of the area of B which is occupied by A
Intuition:
The probability of an event is the expected fraction of time that the outcome would occur with repeated experiments.
Building machine learning models
Maximum Likelihood Estimation
Given a probability model with some vector of parameters (Theta) and observed data
D
, the best fitting model is the one that maximizes the probability
Bayes’ rule
can be leveraged to understand
competing hypotheses
odds is fraction of two probabilities
i.e. 2/1
Posterior odds = ratio of probability of generating data * prior odds
Independence
Two events are
independent
if one event doesn't influence the other
A and B are independent if P{AnB}=P{A}*P{B}
Chebyshev’s inequality
For
any
random variable X (no assumptions) at least 99 of the time
The Gaussian curve
Key Properties
Central limit theorem
is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
Maximum entropy distribution
Amongst
all
continuous RV with E[X]=0, Var[X]=1. H(X) Entropy is maximized uniquely for X~N(0,1)
Gaussian is the
most
Randon RV with fixed mean and variance
General Gaussian Density
Standard Gaussian
(Normal Distribution) Density
E[X]=0
Var[X]=1
Random variables
is a function X that takes in an outcome and gives a number back
Discrete X takes only at most countable many values, usually only a finite set of values
Expected Value
mean
Variance
how close to the mean are samples
Standard Deviation
Entropy
Entropy
(
H
)
The Only Choice was the Units
firstly you need to choose the base for the logarithm.
If the base is not 2, then
Entropy
should be divided by log2 of the base
Examples
One coin
Entropy
= one bit of randomness
H (1/2)
T (1/2)
Two coins
Entropy
= 2 bits of randomness
H
HH (1/4)
HT (1/4)
T
TH (1/4)
TT (1/4)
A mixed case
Entropy
= 1.5 bits of randomnes =
=1/2(1 bit) + 1/2(2 bits)
H (1/2)
T
TH (1/4)
TT (1/4)
Examine the Trees
if we flip n coins, then P=1/2^n
# coin flips = -log2(P)
Continuous random variables
For many applications ML works with
continuous random variables
(measurement with real numbers).
Probability density function