Please enable JavaScript.
Coggle requires JavaScript to display documents.
Math for Machine Learning (Probability :game_die: (Random variables (is a…
Math for Machine Learning
Vectors
:arrow_right:
Geometry of Column Vectors.
Operations
Scalar Multiplication
Addition as Displacement
Subtraction as Mapping
Vectors as Directions
Measures of Magnitude
Norms
- are measures of distance
Norm Properties
All distances are non-negative
Distances multiply with scalar multiplication
If I travel from
A
to
B
then
B
to
C
, that is at least as far as going from
A
to
C
(Triangular Inequality)
Types of Norms
Euclidean Norm (L2-Norm)
Lp-Norm
L1-Norm
L∞-Norm
L0-Norm
Despite the name, it's
not
a norm
It's a number of non-zero elements of a vector
Geometry of Norms
Matrices
:bookmark_tabs:
Dot product and how to extract angles
Dot Product
- the sum of the products of the corresponding entries of the two sequences of numbers.
Angles
Key Consequences
Orthogonality
v*w=0
vw<0 vw>0
Hyperplane
- is the thing orthogonal to a given vector
perpendicular line in 2D
perpendicular surface in 3D
Decision plane
2D
3D
1D
Matrix multiplication and examples
If A is a matrix where the rows are features wi and B is a matrix where the columns are data vectors vj then the i,j-th entry of the product is wi*vj, which is to say the i-th feature if the j-th vector.
In formulae: if C=AB, where A is an
n x m
matrix and B is a
m x k
matrix, then C is a
n x k
matrix where
Well defined:
Hadamard product
An (often less useful) method of multiplying matrices is element-wise
AoB
Matrix product properties
Matrix Products
Distributativity
A(B+C) = AB +AC
Associativity
A(BC)=(AB)C
Not commutativity
AB!=BA
The Identity Matrix
IA=A
All ones on the diagonal
Properties of the Hadamard Product
Distributativity
A(B+C) = AB +AC
Associativity
A(BC)=(AB)C
Commutativity
AoB=BoC
Geometry of matrix operations
Intuition from Two Dimentions
Suppose A is 2x2 matrix (mapping R^2 to itself). Any such matrix can be expressed uniquely as a
stretching
, followed by a
skewing
, followed by a
rotation
Any vector can be written as a sum scalar multiple of two specific vectors
A applied to any vector
The Determinant
det(A) is the factor the area is multiplied by
det(A) is negative if it flips the plane over
Determinant computation
The Two-by-two
det(A)=ad-bc
Larger Matrices
m determinants of (m-1)x(m-1) matrices
computer does it simplier Q(m^3) times
called matrix factorizations
Matrix invertibility
When can you invert?
it can be done only when det != 0
How to Compute the Inverse
A^(-1)*A=I
Linear dependence
Definition
lies in lower dimentional space
if there are some a-s, that
a1*v1+a2*v2+...+ak*vk=0
Example
a1=1, a2=-2, a3=-1
det(A)=0 only if columns of A are linearly dependent
Probability
:game_die:
Axioms of probability
Something always happens.
If two events can't happen at the same time (disjoint events), then the fraction of the time that
at least one
of them occurs is the sum of the fraction of the time either one occurs separately.
The fraction of the times an event occurs is between 0 and 1.
Visualizing Probability
using Venn diagram
General Picture
Sample Space <-> Region
Outcomes <-> Points
Events <-> Subregion
Disjoint events <-> Disjoint subregions
Probability <-> Area of subregion
Inclusion/Exclusion
Intersection
of two sets
Union
of two sets
Symmetric difference
of two sets
Relative complement
of A (left) in B (right)
Absolute complement
of A in U
Conditional probability
If I know B occurred, the probability that A occurred is the fraction of the area of B which is occupied by A
Bayes’ rule
can be leveraged to understand
competing hypotheses
odds is fraction of two probabilities
i.e. 2/1
Posterior odds = ratio of probability of generating data * prior odds
Independence
Two events are
independent
if one event doesn't influence the other
A and B are independent if P{AnB}=P{A}*P{B}
Random variables
is a function X that takes in an outcome and gives a number back
Discrete X takes only at most countable many values, usually only a finite set of values
Expected Value
mean
Variance
how close to the mean are samples
Standard Deviation
Chebyshev’s inequality
For
any
random variable X (no assumptions) at least 99 of the time
Entropy
Examples
One coin
Entropy
= one bit of randomness
H (1/2)
T (1/2)
Two coins
Entropy
= 2 bits of randomness
H
HT (1/4)
HH (1/4)
T
TH (1/4)
TT (1/4)
A mixed case
Entropy
= 1.5 bits of randomnes =
=1/2(1 bit) + 1/2(2 bits)
H (1/2)
T
TH (1/4)
TT (1/4)
Examine the Trees
if we flip n coins, then P=1/2^n
# coin flips = -log2(P)
Entropy
(
H
)
The Only Choice was the Units
firstly you need to choose the base for the logarithm.
If the base is not 2, then
Entropy
should be divided by log2 of the base
Continuous random variables
For many applications ML works with
continuous random variables
(measurement with real numbers).
Probability density function
The Gaussian curve
Standard Gaussian
(Normal Distribution) Density
E[X]=0
Var[X]=1
General Gaussian Density
Key Properties
Maximum entropy distribution
Amongst
all
continuous RV with E[X]=0, Var[X]=1. H(X) Entropy is maximized uniquely for X~N(0,1)
Gaussian is the
most
Randon RV with fixed mean and variance
Central limit theorem
is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
Building machine learning models
Maximum Likelihood Estimation
Given a probability model with some vector of parameters (Theta) and observed data
D
, the best fitting model is the one that maximizes the probability
Intuition:
The probability of an event is the expected fraction of time that the outcome would occur with repeated experiments.
Terminology
Outcome
A single possibility from the experiment
Sample Space
The set of all possible outcomes
Capital Omega
Event
Something you can observe with a yes/no answer
Capital E
Probability
Fraction of an experiment where an event occurs
P{E} є [0,1]
Univariate Derivatives
:silhouette:
Gradient Descent
Goal
: Minimize f(x)
ALGORITHM
2.
Iterate through
- is learning rate
3.
Stop after some condition is met
if the value if x doesn't change more than 0.001
a fixed number of steps
fancier things TBD
1
. Start with a guess of X0
As simplistic as this is,
almost all
machine learning you have heard of use
some version
of this in the
learning process
Issues
:
how to pick eta
recall that an
improperly chosen
learning rate will cause the entire optimization procedure to either
fail
or
operate too slowly
to be of practical use.
Sometimes we can
circumvent
this issue.
Maximum Likelihood Estimation
find
p
such that
Pp(D)
is maximized
Derivative
let's approximate
better approximation
f(x +є) = f(x) + f'(x)є
Can be presented as:
Most usable
http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
Rules
Sum Rule
Product Rule
Chain Rule
Alternative
Quotient Rule
Interpretation
Second Derivative
f''(x)
shows how the slope is changing
max
-> f '' < 0
min
-> f '' > 0
Can't tell
-> f '' = 0
proceed with higher derivative
Newthon's method
Idea
Minimizing f <---> f '(x)=0
Now look for an algorithm to find the zero of some function g(x)
Apply this algorithm to f '(x)
Pictorially
g(x) x such that g(x)=0
Computing the Line
line: on (x 0, g(x 0))
slope g '(x 0)
y=g '(x 0) (x-x 0)+g(x 0)
solve the equation y=0
Update Step for Zero Finding
we want to find where
g(x)=0
and we start with some initial guess x0 and then iterate
Update Step for Minimization
To minimize
f
, we want to find where
f '(x)=0
and thus we may start with some initial guess x0 and then iterate Newton's Method on
f '
to get
Relationship to Gradient Descent
A learning rate is adaptive to f(x)
Multivariate Derivatives
:silhouettes:
If you have a function
where n - n-dimensional vector,
Then it's a function of many variables.
You need to know how the function responds to changes in all of them.
The majority of this will be just bookkeeping, but will be terribly messy bookkeeping.
Partial Derivatives
is a measure of the rate of change of the function... when one of the variables is subjected to a small change but the others are kept constant.
Example
:
The Gradient
Definitions
Vector derivative
The Gradient
collection of partial derivatives
Matrix derivative
Visualize
Key Properties
at local max & min
points in the direction of
maximum increase
-
points in the direction of
maximum decrease
Gradient Descent
Level set
Matrix Calculus
Second Derivative
For
, there is
many derivatives
Hessian
2D intuition
Hf=
Hf=
Hf=
Critical points
=0
det(Hf) < 0
saddle point
det(Hf) > 0
further investigation
tr(Hf) < 0
local maximum
tr(Hf) > 0
local minimum
tr(Hf) = 0
does not happen
det(Hf) = 0
unclear
need more info
If the matrix is diagonal, a positive entry is a direction where it curves up, and a negative entry is a direction where it curves down
Trace
sum of diagonal terms tr(Hf)
Newton Method
The computational complexity of inverting an
n x n
matrix is not actually known, but the best-known algorithm is
O(n^2.373)
For high dimensional data sets, anything past linear time in the dimensions is often impractical, so Newton's Method is reserved for a few hundred dimentions at most.
Convexity
the function is convex if the line between two points stays above
Opposite is Concave
Benefits
When a function is convex then there is
a single
unique
local minimum
no maxima
no saddle points
Gradient descent
is guarantied to find the global minimum with
small enough
Newton's Method always works
Derivative Condition
Hf is called positive semi-definite if
And this implies f is convex
A Warning
When going to more complex models i.e. Neural Networks, there are many local minima & many saddle points.
So they are not convex