Please enable JavaScript.

Coggle requires JavaScript to display documents.

Bishop (Mixture Models -assumption joint distribution of \( p(X,Z) \)…

- - - - from left to right: init, Step1, Step2
  - - - Possible Problem of singularities of likelihood for multiple classes (extreme overfitting) :red_cross:
      - EM for Gaussian Mixture :check:
        
        choose initial values for parameters ( mean ...)
        
        eveluate responsibilities for all datapoints
        
        use responsibilities to update parameters
        
        commonly use K-means to initilaize means
        
        Visualization
    - - Grahpical Model for complete Data
      - If we do not have complete Data look at posterior
        
        -> the posteriror facotirizes of z_i ( d-speration )
        
        which now shoes that the result we motivated/"derived" for the gaussian mixture follow from our alternative more abstract view
      - Equavalence to K means for variance going to zero
  - - - training procedure
        
        input image x
- - - - Generalization Canonical Link Functions
        
        \( \rightarrow \)
        
        it seems like the canonical link function for Bernulli ( 2 Classes ) is the sigmoid for multinimial it is the softmax for regression with gaussian noise it is just the identity
      - Multiclass
        
        own thoughts this should be more robust to outliers as y_k saturates
    - - Discriminative approach
        
        Fisher Linear Disciminant
        Goal: find a better Projection axix !
        
        consider within class variance
        
        maximize the ration of mean distance vs inter class variace
        maximise with respect to w ->
        
        \( S_B w \) always in direction of \( m_2 - m_1 \)
        
        ->
        
        for n dimensional genralizaiton see Bishop ( Idea projected between class covariance is big within class covariance small
        
        Least Squares
        
        -> closed form solution
        
        choosing special targets \( t_1 = \frac{N}{N_1} , \ t_2 = \frac{N}{N_2} \)
        -> this is equivalent to fisher det!
        
        own thoughts: stupid approach becuase outliers contribute to error even though they are very correctly classified
        you are looking vor vector w_k which give 1 when you project daapoints from class onto it and 0 when you project all other points ffrom opther lasses onto it
        -> problem should partly be solver by activation funtion
        
        The perceptron
        
        with \( \mathcal M \) the set of all misclassified patterns
        stoachastic gradient descent
        
        no unique solution :red_cross:
        
        never converges for datasets which are not linarly seperable :red_cross:
        does not generalize to more than 2 classes and fixed basis functions
        
        Better than least squares
        still sensitive to outliers that are classified negatively
      - Generative Models
        
        2 Classes
        
        with a =logratio
        Multiclass
        
        exponential family
        assume class conditionals are exponential family
        
        assume discrete features
        
        assume class conditional densities are gaussian
        
        2 classes
        in this case the quadratic x term cancelsbecause the covariance is the same for different classes and we get linear dicision boundary
        
        Multiclasses: (the same cancellation
        
        If we hade sepereta covariances we would have quadratic funtions of x !
        
        ML 2 Classes
        define parametric form of class conditional densities \( p(x|C_k ) \) -> determine parmeters and prior class probabilities
        
        we get the expected result and so on
    - - Multiple classes
        
        assign point to calss \( \mathcal C_k \) if \(y_k(x) > y_j(x) \)
        Decision Boundary
      - 2 Classes
        
        assign \( \mathcal C_1 \) if \(y(x) > 0\)
- - - - Reguralrized least squares
        
        -> shift Problem of determining number of basis functions to determning \( \lambda\) effective model complexiity :red_cross: :check:
        
        Sparsity
    - - MAP
        
        Idea maximize posterior
        
        Equivalent to RLS with \( \lambda = \alpha / \beta \)
      - Sequential learning (Baysian view update your believes)
        Step 1:start with prior \( p(w) \)
        Step 2 : calculate posterior \( p(w|x_1) = \frac{p(x_1|w) p(w)}{\int p(x_1|w) p(w)} \)
        Step 3 : update prior with posterior \( p(w) = p(w|x_1) \)
        repeat with new Datapoints \(x_i\)
      - Predictive distribution
        
        convolution of 2 gaussians -> see Gaussian Marginal
        
        in the limit the second term goes to zero (proof?)
      - Own Thoughts:
        Consider Gaussian Basis Functions
        -> basisfunctions \( \phi_j , \phi_i \) with mean close to many datapoints will have large entries in \( ( \Phi^T \Phi)_{i,j} = \sum_n \phi_i(x_n) \phi_j(x_n) \)
        
        looking at Bayesian Aproach is clear that the \( (S_N^{-1})_{i,j} \) is high
        
        this is just the general fact that the posterior will have lower variance than the prior and for basis functions "excited" by many/no datapoints this effect becomes strong/none
        
        for the predictive distribution this means that the uncertainty around many datapoints decreases (assuming localized basis functions) as the variance depends on the excitations \( \phi_j(x) \) and the relevant entry of \( S_N \)
        
        furthmore the predective distribution is quite certain in predicting the bias with \( \sigma = \frac{1}{\beta} + \phi_0 (S_N)_{0,0} \phi_0 \) for points which do not excite basis functions as the second term vanishes ( the bias \( \phi_0\) is always excited ) this is bad says bishop -> solved by Gaussian process :red_cross:
      - Equivalent Kernel :red_cross:
        
        look at posterior mean solution for \( m_0 =0 \) ( mean of posterorior \( p(w| \mathbf t ) \))
      - Bayesian Model Comparison
        
        the Model defines \( p( \mathbf t| \mathbf X) \) or \( p( \mathbf t, \mathbf
        X) \)
        
        want to evaluate posterior over models after having observed Data ( bayesian viewpoint of uncertainity )
        
        assume flat prior
        
        \( \rightarrow \) model evidence is the interesting term
        
        assumes that the correct model is contained in the set of Models ( as the posterior over models sums to one!)
        if that is the case whis is always > 0 shows the fraction of model evidences is on average positive
        
        Side Note
        Bayesian Model Comparison
        
        overfitting can be ovoided by marginalizing over model parameters instead of making point estimates :check:
        
        assume a parameter model that defines \( p( \mathbf t| \mathbf X) \) or \( p( \mathbf t, \mathbf
        X) \) in the second case the probability of the data X under the model is also considered ! or in other words \( \mathcal D = ( \mathbf {t,X } )\) instead of only \( \mathcal D = \mathbf t \)
        
        prior as defined by model
        
        assume sharp peak of posterior around maximum
        
        also assume flat prior \( \rightarrow \) sharply peaked \(p( \mathcal D|w, \mathcal M_i ) \) -> aproximate the integral where the dependence on \( \mathcal M_i \) should be denoted!
        
        \( \Delta w_{\text{posterior}} = \Delta \text{width Integrand } \) (assume flat prior)
        
        \( \rightarrow \) for complex models ( high dimensional parameter space ) will grow as dimensionality grows
        complexity penalty
        
        optimal model: trade of between complexity and data likelihood
        
        Predictive Distribution:
        
        this a mixture distributoin
        
        Own thoughts on Bayesian mystery
        
        the reason the complexity penalty arises is that under the asumption of broad priors, complex models a priori assign lower probabilities to every point in parameter space (as the parameter space is high dimensional) and thus low probabilites are assignes to the region that best explains the data ( high \( p(\mathcal D| w, \mathcal M_i \) ). Although complex model might have a higher \( p(\mathcal D| w, \mathcal M_i) \) than simpler models the effect can be canceled out by complexty penality leading to a higher model evidence for simple models
        
        other point of view highdimensional paramter models = flexible model? can create many datasets and thus \( p( \mathcal D| \mathcal M_i) \) is lower for complex models
        
        stron asumption equal probabilities
        
        Conclusion
        
        overfitting adressed :check:
        
        problem wrong model asumptions lead to misleading results
        
        improper prior leads to undifines model evidence or zero model evidence if we take the limes ( \( p(D) \propto \frac{1}{\Delta w_{prior}} \)-> look at fraction of model evidences to compare models and take limes afterwards
      - Evidence Aproximatation
        
        Idea full Bayesian predictive distrib
        
        Integration over both hyperparameter \( \alpha , \beta \) and parameters \( w\) Intractable ! -> :red_cross:
        
        -> assume is sharply pearked around \( \hat \alpha , \ \hat \beta \)
        which is tractable :check:
        
        How to maximize marginal likelihood ?
        
        assume flat prior \( \rightarrow \) maximise
        
        \( \rightarrow \) Option 1 see EM , Option 2 see here
        
        Idea complete square with respect to w
        
        Integrate over w and find
        
        set derivative with respect to \( \alpha \) zero
        
        set derivative with respect to \( \beta \) zero
        effective number of observatrions
        
        Intepretation
        
        in eigenbasis \(w_{map} = m_n = \frac{\beta}{\alpha +\lambda_i} \mathbf I_{i,j} \ \mathbf \Phi^T_j \mathbf t = \frac{\beta \lambda_i}{\alpha +\lambda_i} \mathbf I_{i,j} \mathbf (w_{ML})_j\)
        
        if the ratio \( \frac{ \lambda_i}{\alpha +\beta \lambda_i} \) is big \( m_n \) is close to \( w_{ML} \) otherwise it is close to zero ( the prior)
        
        Own thoughts
        maximising the marginal likelihood with respect to \( \alpha \) , \( \beta \) is the maximisation over all configurations of \( w\) instead of the \( w_{MAP} \) or \(w_{ML} \) configuration :red_cross:
        -> \( \frac{1}{\alpha} = (\mathbf m_n - \mu)^T ( \mathbf m_n - \mu) / \gamma \) with \(/ \mu = 0 \)
        
        this gives the result that \( \alpha \) is like the Trace of the discounted "variance" of the mean prediction from zero and \( \beta \) is is like the discounted variance between data and mean prediction :red_cross: continue page 171 for comparison
        
        marginalizing the predictve disbrituion over all \( w\) will give you a gaussian with mean = maximmum a posterior \( w_{MAP} \)
      - Problem with fixed basis functions :red_cross:
        curse of dimensionality -> need exponential numkber ob basis functions with dimensionality D of inpout space
        :red_cross:
        
        Data manifold is smaller than input space -> use support vector machine relevance suppert vector neural networks
- - - - Marginal and Conditional