Please enable JavaScript.

Coggle requires JavaScript to display documents.

COMP41011 Foundations of Machine Learning - Coggle Diagram

- - - - Handwritten digit recognition
      - Face detection and recognition
      - Age prediction
      - Stock market prediction
      - Clustering and segmentaton of customers
      - Recommendation systems
  - - - Overtraining: learning the data instead of the patterns
      - Underfitting: model is too simple to capture the complexity of the data
    - - Variable y is discrete: classification
      - Variable y is continuous: regression
    - - We only have access to instances, not the lables so we:
      - Find similar groups (clustering)
      - Find a probability function for X: density estimation
      - Find a lower dimensionality representation for X (dimensionality reduction and visualisation)
- - - - How will the solution be used?
      - What are the current solutions (if any)?
      - Is it a regression or classification problem?
      - How is performance measured?
      - What would be the minimum performance?
      - Are there comprable problems?
      - Is human expertise available?
    - - We use several metrics to asses the performance of a model
      - Metrics in regression
        
        Root mean squared errror (RMSE)
        
        This is the preferred performance measure for regression tasks
        
        Mean absolute error (MAE)
        
        Used when there are many outlier districts
      - Metrics in classification
        
        Precision
        
        The ratio of correct positive predictions to the overall number of positive predictions
        
        = TP / (TP + FP)
        
        or the fraction of relevant documents in the documents returned
        
        Accuracy
        
        the ratio of examples correctly classified over the total number of examples classified
        
        = (TP + TN) / (TP + TN + FP + FN)
        
        useful when errors predicting all classes are equally important, but can be misleading in imbalanced classification problems.
        
        Recall
        
        The ratio of correct positive predictions to the overall number of positive examples in the dataset
        
        = TP / (TP + FN)
        
        or the fraction of relevant documents returned to the total relevant documents
        
        Confusion matrix
        
        TP - True Positive, FP - False Positive, TN - True Negative, FN - False Negative
        
        A table that summarises how successful the classification model is at predicting examples belonging to various classes
  - - - What data do you need and is it enough?
      - Legal obligations regarding the data
      - Removing sensitive information
      - Check the data size and type
      - Sample a test set and do not look
    - - Training
        
        Largest set
        
        Used to fit the model using the objective function
        
        For a large dataset use 70-95% for training
        
        For smaller dataset use as many training instances as possible and use cross-validation to make better use of this and the validation set
      - Validation
        
        Same size as test
        
        Used to choose the best predictive model among a set of candidates
        
        For a large dataset use 15-2.5% for validation
      - Test
        
        Same size as validation
        
        Used to asses the final performance of the model before shipping the model to production.
        
        For a large dataset use 15-2.5% for testing
    - - OR Leave-one-out cross validation
  - - - Create a copy of the data for exploration or sample a fraction of a large dataset for exploration
      - Study each feature and its characteristics including
        
        name, noisiness, type, percentage of missing values, type of distribution, is the feature useful?
      - Identify the target attribute (for supervised learning)
      - Visualise the data
      - Study correlations between attributes
      - Identify what transformations you might be able to apply
      - Is there any additional data that you might find useful?
    - - The correlation coefficient between two RVs X and Y is given as
      - which is between -1 and 1 and σx,y is the covariance between X and Y
  - - - Remove outliers (optional)
      - Handle missing values
        
        Fill them in using the mean, median or other value
        
        Drop the feature if most of the instances have a missing value
        
        Drop an instance if you have several instances with missing values
        
        Add an additional feature describing whether the instance is missing the value
      - Most ML methods require features that are numbers rather than categories, so we use one-hot encoding to obtain a higher dimensional binary representation for each value, as when values don't have a natural order they should not be mapped in order.
    - - Remove features that are uninformative
      - Discretize continuous features (binning)
      - Create new features from available ones
    - - Several ML methods do not perform well when input features have very different scales, so we try to get all features to the same scale
      - Normalisation (min-max scaling)
        
        Map the range of values that a feature takes to the range [-1, 1] or [0, 1]
      - Standardisation (z-score normalisation)
        
        The features are scaled so that they have mean 0 and standard devation 1.
      - Which scaling to use?
        
        If the dataset is small and there is time, try both and see which performs better. Else:
        
        Unsupervised learning algorithms usually benefit more from standardisation
        
        Standardisation also works when the feature has already a distribution close to Gaussian
        
        If a feature has outliers, standardisation is also preferred
        
        Normalisation is preferred in all other cases
  - - - Use random or grid search to explore hyperparameters
      - Consider the data transformations
      - Use Auto-ML to fine-tune
      - Once your are confident about your final model, merge the training and validation sets and fit the model again
      - Finally use the test data to assess the generalisation error of your ML system (do not change your model as you might overfit to the test data)
- - - - A discrete RV X is completely defined by a set of values it can take x1, x2, ..., xn and their corresponding probabilities
      - The probability that X = xi is denoted as P(X = xi) for i = 1, ..., n and is known as the probability mass function (pmf)
    - - In machine learning, we are usually interested in more than one random variable and they can be fully described with a joint probability mass function P(X=xi, Y=yj)
    - - Marginal
        
        We obtain the probability of P(X=xi) regardless of the value of Y. This is known as the sum rule of probability
      - Conditional
        
        From this we can also write P(X = xi, Y = yj) = P(X = xi | Y = yj)P(Y = yj). This is known as the product rule of probability
    - - Two discrete RVs are statistically independent if P(X = xi , Y = yj) = P(X = xi)P( Y = yj), i = 1, . . . , n j = 1, . . . , m
    - - The expected values or statistical moments of the discrete RV X, used frequently are the mean and the variance.
      - The squared root of the variance is known as the standard deviation.
  - - - A continuous RV X takes values within one or more intervals of the real line
      - We use pdfs, pX(x), to describe a continuous RV X.
    - - Additionally we are often interested in more than one continuous RV, so we use a joint pdf, pX,Y(x, y), to fully characterise two continuous random variables.
    - - Sum rule of probability
        
        where px(x) is known as the marginal pdf
      - Product rule of probability
        
        which can also be written as pX,Y(x, y) = pX|Y(x|y)pY(y)
    - - We say that two continuous RVs are statistically independent if pX,Y(x, y)= pX(x)pY(y)
    - - For continuous RVs, expected values are computed as
- - - - if so then
  - - - The transpose of the product Xw, (Xw)T = wTXT
  - - - A square matrix with 1s on the main diagonal (where i = j) and 0s elsewhere
  - - - We find the update equation for each of the parameters by differentiating the error function and rearranging to find the specific parameter, and finding the second derivative to ensure that the point is a minimum
      - We are trying to minimise the error function, by updating the parameters to their optimal values, moving one at a time.
      - We start by using the maximum likelihood update to find an estimate for the first parameter
      - The update equations are alternated in a loop until the get a good fit and meet the stopping criterion, in this case, the difference between the previous error and the current is < 1e-4
  - - - and
      - Any sum of squares can be represented by an inner product b^Tb, so if we define all values in a vector y and every function f(x_i;w) in a vector f(X;w) or just f, the objective function can be written as
      - Aside: is a vector-valued function where the elements are defined as fuction
      - If we incorporate the design matrix X, we get f = Xw, and with this result we have defined the model with the two equations, the one to tell us the form of the predictive function and the other, the form of the objective function
    - - Model contains a function that shows how it will be used for prediction and a function that describes the objective fuction we need to optimise to obtain a good set of parameters
      - We find the minima of the objective function using multivariate calculus to find the stationary points.
      - We define vectorial derivatives as where each part is known as the partial derivative of the error function with respect to w_1 and w_2
      - Matrix differentiation
        
        Differentiation of an inner product
        
        Differentiation of a 'matrix quadratic'
        
        A scalar quadratic in z with coefficient c = cz^2. If z(kx1) and c(kxk) the cz^2 = z^TCz which is a scalar, but a function of a vector
      - Matching dimensions in matrix multilications
        
        AB where A(kxp) and B(qxr) only works when p = q and the dimensions of AB would be kxr.
        
        Matrix multiplication is not commutative unless the matrices are square and symmetric.
      - Differentiating the objective
    - - We seek stationary points by finding the vectors that solve for when the gradients are 0.
      - Solving the multivariate system
        
        The solution for w is given in terms of a matrix inverse, but computation of a matrix inverse requires an algorithm. It is better to ask the computer to solve the system of equations given by for w using numpy.linalg.solve
  - - - When we refer to non-linear regression, we are normally referring to whether the reression is non-linear in the input space or covariates, the observations that move with the target variable e.g. the vector of covariates x_i associated with the ith observation corresponding to the target y_i
      - If a model is non-linear in the inputs, it means that there is a non-liner function between input and target space
      - An easy way to make the linear regression non-linear is to introduce non-linear functions, known as basis functions
    - - Although this example is a non-linear regression, it is still a linear model because it is linear in the parameters, f(x) = w^TΦ(x)
      - In practice basis functions may contain its own set of parameters f(x) = w^TΦ(x;θ)
- - - - E{y} = E{constant + ϵ} = constant
      - var{y} = var{constant} + var{ϵ} = σ^2
      - This means that , and the given constant means
      - Because we assumed and x and w are given, we can also write
- - - - Line search methods (may use search directions other than the steepest descent direction)
      - Conjugate gradient (method of choice for quadratic objectives g(w) = w^TAw)
      - Use a Newton search direction
  - - - Traditionallt in machine learning, the gardient g_k is computed using the whole dataset
      - There are settings where only a subset of the data can be used
        
        Online learning: the instances (x_n, y_n) appear one at a time
        
        Large datasets: computing the exact value of g_k would be expensive, if not impossible
    - - The gradient g_k is computed using a subset of the instances available
      - The word stochastic refers to the fact that the value for g_k will depend on the subset of the instances chosen for computation (and it is important to randomly sample)
      - In the stochastic setting, a better estimate can be found if the gradient is computed using where is the cardinality (number of samples in the set) of S, and g_(k, i) is the gradient at iteration k computed using the instance (x_i, y_i)
      - This setting is called mini-batch gradient descent
    - - Choosing the value of η is particularly important in SGD since there is no easy way to compute it
      - Usually the value of η will depend on the iteration k, η_k
      - It should follow the Robbins-Monro conditions
      - Various formulas for η_k can be used where τ0 slows down in early iterations and k ∈ (0.5, 1]
- - - - For each link, we have bias and weight terms.
      - Superscript always refers to the layer number
      - z is the linear output and a is the activation output
      - Vector and matrix notation:
  - - - and given error E(a^L, y), we want
      - If there are n_L units in layer L, in the N_L, n_L Jacobian matrix:
      - If h_L is applied element-wise, then this matrix is diagonal
    - - Compute the gradient with respect to z^l
  - - - As many matrix multiplications as there are fully connected layers, performed twoce during forward and backward pass
    - - Need to store vectors a^l, z^l and for each layer
    - - Yes, if we minibatch, we perform tensor operations
      - Make sure that all parameters fit in GPU memory
- - - - For each nonlinear unit a_i = h(z_i), to normalise the pre-activations within a mini-batch of size K:
      - The mean and variance are computed across the mini-batch separately for each hidden unit
      - Commonly used in convolutional neural networks
    - - To normalise across the hidden-unit values for each data point separately
      - The mean and variane are computed across the hidden units separately for each data point
      - Commonly used in recurrent neural networks and transformers
  - - - a) visualisation of the error surface for a network with 56 layers
      - b) the same network with the inclusion of residual connections
- - - - 1, the gradient can be propagated
      - greater than 1, the product will grow exponentially (explode)
      - < 1, the product shrinks exponentially (vanishes)
- - - - I swam across the river to get to the other back
      - I walked across the road to get cash from the bank
  - - - Define a fixed dictionary and introduce vectors of length equal to dictionary size
      - To encode the k-th word with the vector x_n, having 1 in position k and 0 elsewhere
      - Results in vectors of very high dimensionality if the dictionary is large
    - - Defined a matrix E of size D x K: D the dimensionality of the embedding space, K the dimensionality of the dictionary: v_n = Ex_n, where vector v_n is the corresponding column of the matrix E (can be learned, e.g. word2vec)
      - Learned embedding space often has an even richer semantic structure
      - As a pre-processing step: part of the overall end-to-end training
    - - A set of vectrs {x_n} of dimensionality D, where n = 1, ..., N
      - Vectors are tokens, which might correspond to a word or byte pair
    - - nth row comprises the token vector x^T_n
      - This matrix represents one set of input tokens
      - For most applications, we will require a data set containing many sets of tokens
  - - - Takes X as input and outpust a transformed matrix Y of the same dimensionality, Y = TransformedLayer[X]
      - Can apply multiple times in succession to construct deep networks