Please enable JavaScript.

Coggle requires JavaScript to display documents.

Machine Learning - Coggle Diagram

- - - - Info gain
        
        PRS-PSUPR-Day2-Slide14
        
        For continuous variable for attribute
        
        PRS-PSUPR-Day2-Slide19
        
        bias towards choosing features will large no. of values e.g., id column. This will cause overfitting
      - Info gain ratio
        
        PRS-PSUPR-Day2-Slide23
      - Gini index
        
        PRS-PSUPR-Day2-Slide24
    - - min_samples_split
        
        min no. of samples before an internal node will split
      - min_samples_split
        
        min no. of samples to be at a leaf node
      - max_depth
        
        max depth of tree
    - - Pre-pruning or post-pruning (better)
        
        Pre-prune when data split is not statistically significant or too few examples in a split
        
        Post pruning is to remove branches from fully grown tree
    - - Simple to understand and intepret
      - Little data preparation and little computation
      - Indicates which attributes is most important for classification
    - - Perform poorly with many classes and small data
      - not guaranteed for an optimal decision tree
  - - - always 3 layers - input, hidden and output
      - weight determined by statistics instead of backpropagation
    - - Finds this hyperplane using support vectors and margins where margin of separation is maximised
        
        Support vectors - points lie closest to the decision surface and are therefore most difficult to classify
  - - - Distance metrics can be euclidean distance or hamming distance
      - k value if too small, will be sensitive to noise. k value if too big, may include irrelevant points. Choose odd value for k
  - - - Hence the laplace correction, to increase the count of variable to1 so that overall probability won't become zero
    - - Easy, fast to predict
      - Performs well in multi-class prediction
      - When assumption of independent variables hold, naive bayes performs better than logistics regression
      - Does well for categorical. For numerical, need to assume normal distribution
    - - Almost impossible to get independent variables
- - - - PRS-PSUPR-Day4-Slide19
    - - PRS-PSUPR-Day4-Slide22
  - - - PRS-PSUPR-Day4-Slide28
      - Elbow method
        
        PRS-PSUPR-Day4-Slide31
    - - No need specify number of clusters
      - Not popular because needs sufficiently dense areas
  - - - Can explain clusters in practical terms, distinguishing features across cluster profiles
- - - - PRS-PRMLS-Day5-Slide6
    - - Fill in missing values, smooth noisy data, remove outliers, resolve inconsistencies
    - - Integrate multiple databases
    - - Normalisation
        
        Scaling / Standardisation
        
        Use on both training and testing datasets
        
        PRS-PSUPR-Day1b-Slide7
      - Aggregation
    - - Dimensionality reduction
        
        Reduce complexity and improve generalisation
        
        Feature extraction (Add even more features!)
        
        Principal Component Analysis
        
        PRS-PSUPR-Day3-Slide27
        
        Principal components are linear combinations of original features, perpendicular to each other and capture max variance in data
        
        Need to check for appropriateness of PCA
        
        Bartlett's Sphericity Test
        
        PRS-PSUPR-Day3-Slide45
        
        Kaiser-Meyer-Olkin Test
        
        PRS-PSUPR-Day3-Slide44
        
        Assumptions:
        
        150+ samples
        
        The factors should have some correlation
        
        Continuous variables
        
        Outliers: Remove data that are > standard deviations
        
        Linear Discriminant Analysis
        
        PRS-PSUPR-Day3-Slide69
        
        PCA maximise data variance but LDA maximise separation
        
        PCA is unsupervised but LDA is supervised
        
        Assumes data is normally distributed, doesn't work if discriminatory info is in variance of data, need bigger sample size, produces c number of classes
        
        General methods
        
        Aggregation
        
        ratio e.g., income to debt ratio
        
        Change of scale from days to weeks
        
        Discretisation
        
        thru equal binning or equal frequency
        
        Feature scaling
        
        Normalisation
        
        Standardization
        
        PRS-PSUPR-Day3
        
        Sampling
        
        only when too expensive or time consuming to process all the data
        
        must be representative of data
        
        Feature subset selection
        
        embedded approach
        
        feature selection alr part of decision tree
        
        filter approach
        
        select features before data mining algorithm
        
        e.g., select attributes that have as low correlation as possible
        
        issue is ignores inter-relationships of features that may affect prediction
        
        wrapper approach
        
        backward feature
        
        use all features, remove one at a time
        
        forward feature
        
        no feature and add one at a time
        
        other univariate techniques
        
        too many missing values
        
        using domain knowledge, remove features that do not impact accuracy
        
        attributes with low variance
        
        zero variance means identical value
        
        2 more items...
        
        attributes with high correlation with each other
        
        correlation matrix
        
        3 more items...
      - Numerosity reduction
  - - - Undersampling from normal data and oversampling from other data
    - - Using F1 score = 2 ((precision recall)/(precision + recall))
  - - - When model has high training accuracy, but low testing accuracy
    - - PRS-PSUPR-Day1b-Slide16
    - - The higher the AUC, the better the model
      - PRS-PSUPR-Day1b-Slide17
    - - PRS-PSUPR-Day1b-Slide19
- - - - Average vote for regression
      - Majority vote for class prediction
      - Multiple hypothesis using the same base learner
    - - Significant diversity among models
    - - Underfit is high bias and low variance
      - Overfit is low bias and high variance
      - High bias means big value between target and expected value
      - High variance means high variability amongst data points. High variance means good generalisation
        
        Averaging reduces variance
        
        PRS-PRMLS-Day5-Slide22
        
        If models are correlated, the reduction is smaller
        
        If models have low variance to start with, then averaging doesnt help much
      - Optimal is moderate bias and variance
  - - - Take unweighted average prediction of all models
    - - Each base classifier is trained on less data, which may hurt if data is poor
      - Requires model diversity
        
        Manipulating training data e.g., bagging
        
        Manipulate input features
        
        Varying model classifier type and architecture
  - - - Training speed improves as less computation at each tree split
    - - Variance of RF trees are higher than bagged trees but need 10x as many trees
    - - No need separate test set, just test each tree with the left over data
      - A sample will have been used for testing data a third of the time (assuming 2 thirds are used for training). So this sample with other samples that are used for testing for that tree, will be used to get the prediction of the tree.
        
        For classification problems: 1 for not accurate prediction and 0 for accurate
        
        For regression, sum up (prediction - actual)^2
    - - Sum the total reduction in impurity e.g, through decreases in Gini index for all the nodes that test the variable
      - Permutation method
        
        First get the NormalCorrectVotes given a variable v, then in the OOB/testing data, randomly permutate the variable and retest the tree.
        
        Importance = average ((NormalCorrectVotes - ShuffledCorrectVotes)/TotalVotes)
  - - - PRS-PRMLS-Day5-Slide38
    - - Similar to AdaBoost by gradually adding models to ensemble
      - Different from AdaBoost, because AdaBoost use high weight points but Gradient Boost add a user-defined cost function to loss function
        
        New models are added that move downslope given gradient descent function
        
        Predict using first decision tree stump
        
        Compute the residuals which is where prediction - actual. This residuals will be used as target for the new tree
        
        Predict the new tree using the same variables + new target
        
        Compute the residuals again where original prediction + new prediction using residuals - actual
    - - Boosting may hurt with noisy datasets
        
        Boosting also not as easy to parallelize
- - - - PRS-PRMLS-Day5-Slide50
    - - PRS-PRMLS-Day5-Slide56