Please enable JavaScript.
Coggle requires JavaScript to display documents.
Formal setup (Standard setup for supervised learning (Loss function (0-1…
Formal setup
Standard setup for supervised learning
Loss function
0-1-loss function for classification
squared loss for regression is defined as l(x, y, y′) = (y − y′)2
measures how “expensive” an error is
l : X × Y × Y → R≥0.
Note: the choice of a loss function influences the inductive bias.
loss can also depend on input x or the order of y and y' (the type of error).
True Risk
The true risk (or true expected loss) of a prediction function f : X → Y (with respect to loss function l) is defined as
R(f) := E(l(X, Y, f(X)))
The goal of learning is to use the training data to construct a function f_n whose true risk is as small as possible. (close to the Bayes risk, that is R(f_n) ≈ R^*)
A classifier / prediction function: is simply a function f : X → Y.
Bayes risk and Bayes classifier
The
Bayes risk
is defined as
R^* := inf{R(f) | f : X → Y, f measurable }
f^* := argminR(f) is called the Bayes classifier / Bayes predictor. (If it exists, it is not necessarily unique. )
The underlying space
Probability distribution P on the product space X × Y
no assumption on the form of the probability distribution
both input varibles and output variables (!) are random
quantities
Input space X, output space Y
Sometimes, the spaces X or Y have some mathematical structure (topology, metric, vector space, etc), or we try to construct such a structure.
We assume that each space endowed with a σ-algebra, to be able to define a probability measure on the space. We ignore this issue in the following (for real world machine learning this is not an issue).
Consistency of a learning algorithm
We say that the algorithm A is
consistent
(for probability distribution P) if the risk R(f_n) of its selected function f_n converges to the Bayes risk
if we have convergence almost surely, the algorithm is called
strongly consistent
.
We say that algorithm A is
universally consistent
if it is consistent for all possible probability distributions P over X ×Y.
Ultimately, what we want to find learning algorithms that are universally consistent: No matter what the underlying probability distribution is, when we have seen “enough data points”, then the true risk of our learning rule fn will be arbitrarily close to the best possible risk.
Universally consistent: kNN classifier, support vector machines, boosting, random forests, and many more.
Optimal prediction functions in closed form
for classification under 0-1 loss
for classification under L_2 loss
Basic learning principles: ERM, RRM
Empirical risk minimization (ERM)
Overfitting vs Underfitting
Overfitting happens if F is too large. Then we have a high estimation error but a small approximation error.
Underfitting happens if F is too small. In this case we have a small estimation error but a large approximation error.
Empirical risk minimization approach
Define a set F of functions fromX→Y.
Within these functions, choose one that has the smallest
empirical risk:
Approximation error
: R(f~) - R(f^*). It is a deterministic quantity that does not depend on the sample, but on the choice of the space F.
Estimation error
: R(f_n) - R(f~). It is
a random variable that depends on the random sample.
Denote by f~ the true best function in the set F
As we don’t know P we cannot compute the
true risk
. But we can compute the
empirical risk
based on a sample (Xi,Yi)i=1,...,n
Remark
The key to the success / failure of ERM is to choose a “good” function class F
From the computational side, it is not always easy (depending on function class and loss function, the problem can be quite challenging: finding the minimizer of the 0-1-loss is often NP hard.) This is why in practice we use convex relaxations of the 0-1-loss function.
From a conceptual/theoretical side, ERM is a straight forward learning principle.
Regularized risk minimization (RRM)
Intuition
If I can fit the data reasonably well with a “simple function”, then choose such a simple function.
If all simple functions lead to a very high empirical risk, then better choose a more complex function.
Alternative approach to ERM (hard to choose F)
Define the
regularized risk
R_reg_n(f) := R_n(f) + λ · Ω(f)
Here λ > 0 is called
regularization constant
.
Let F be a very large space of functions.
Then choose f ∈ F to minimize the regularized risk.
Define a regularizer Ω : F → R≥0 that measures how “complex” a function is. Examples: F = polynomials, Ω(f) = degree of the polynomial f; F = differentiable functions, Ω(f) = maximal slope
Statistical and Bayesian decision theory
What if we know all the underlying quantities? Known P(X|Y) and P(Y), given x, what's x's label y?
Maximum likelihood principle
(P(x|y_1) and P(x|y_2), which is bigger?)
Bayesian a posteriori criterion
: Decide based on the posterior distributions P(Y |X)
Just look at priors
(you always predict the label of the “larger class”)
Also take costs of errors into account P123:star: