Please enable JavaScript.

Coggle requires JavaScript to display documents.

Formal setup (Standard setup for supervised learning (Loss function (0-1…

- - - - no assumption on the form of the probability distribution
      - both input varibles and output variables (!) are random
        quantities
    - - Sometimes, the spaces X or Y have some mathematical structure (topology, metric, vector space, etc), or we try to construct such a structure.
      - We assume that each space endowed with a σ-algebra, to be able to define a probability measure on the space. We ignore this issue in the following (for real world machine learning this is not an issue).
- - - - Overfitting happens if F is too large. Then we have a high estimation error but a small approximation error.
      - Underfitting happens if F is too small. In this case we have a small estimation error but a large approximation error.
    - - Define a set F of functions fromX→Y.
      - Within these functions, choose one that has the smallest
        empirical risk:
      - Approximation error: R(f~) - R(f^*). It is a deterministic quantity that does not depend on the sample, but on the choice of the space F.
      - Estimation error: R(f_n) - R(f~). It is
        a random variable that depends on the random sample.
      - Denote by f~ the true best function in the set F
    - - The key to the success / failure of ERM is to choose a “good” function class F
      - From the computational side, it is not always easy (depending on function class and loss function, the problem can be quite challenging: finding the minimizer of the 0-1-loss is often NP hard.) This is why in practice we use convex relaxations of the 0-1-loss function.
      - From a conceptual/theoretical side, ERM is a straight forward learning principle.
  - - - If I can fit the data reasonably well with a “simple function”, then choose such a simple function.
      - If all simple functions lead to a very high empirical risk, then better choose a more complex function.
    - - Define the regularized risk
        
        R_reg_n(f) := R_n(f) + λ · Ω(f)
        
        Here λ > 0 is called regularization constant.
      - Let F be a very large space of functions.
      - Then choose f ∈ F to minimize the regularized risk.
      - Define a regularizer Ω : F → R≥0 that measures how “complex” a function is. Examples: F = polynomials, Ω(f) = degree of the polynomial f; F = differentiable functions, Ω(f) = maximal slope