Please enable JavaScript.

Coggle requires JavaScript to display documents.

ANALYTICS AND MACHINE LEARNING - Coggle Diagram

- - - - Model consists of features and target variable eg tomorrow temp is target variable and windspeed, cloud movement are the features
    - - Most important part of models. The best models will represent reality. Consulting an expert on the literature used is better. Cons-Availability bias. Limited data would cause bias in the training of the model
  - - - Once a model is trained it is time to see whether it can be used in reality. A good model has two properties- good predictive power and generalises well with unseen data. For this we have to defina an error measure and validation strategy
        
        Error measure Classification error measure for classification problem and mean squared error rate for regression problems
        
        Validation Strategy- K folds cross validation-data is divided into 'k' types. one is used as testing data while k-1 is used for training. in this way all the data is used
        Leave one out-only set of data is used for testing rest everything is used for training worked on smaller datasets.
- - - - Regression-form of a predictive analytics technique which studies the relationship between dependent and independent variable
        
        Types-Linear, Logistic, Polynomial, Stepwise, Ridge, Lasso, Elasticnet
    - - Terminology Classifier-maps input variable with class
        Classifcation model-output of classifier. Predicts the output of variable with class
        Feature-it is like one of the columns in a dataset
        Binary classification-where the output could have two option either male or female
      - Types of classification Algorithm- Linear classifier
        Support Vector Machines
        Kernel Estimation
        Quadratic Classifer
        Decision Trees
        Neural Networks
        
        Support Vector Machine-its a supervised machine learning model that uses classification algorithm for two group classification. It has higher speed and better performance with little number of samples
        
        Key Terms
        
        Hyperplane-it is the plane that best separates datapoints of different classes. In a 2D plane-it is a line,
        3D-plane
        higher D -hyperplane
        Margin-distance between the hyperplane and the nearest datapoint.
        Support Vectors-points closest to hyperplane defining margin
        Eg-apples and oranges, and separating them on texture
        
        Linear SVM-data is entirely separable
        
        Non Linear SVM-data is not linearly separable and has to be projected to higher dimensions. That time a trick called kernel trick is used to do so
        
        Linear Kernel-when data is linearly separable
        
        Polynomial-used when data has polynomial relationship
        
        Radial Basis function-used when data is not linearly separable and needs to be projected to higher dimensions
        
        Sigmoid kernel - used in neural networks
        
        Hard Margin SVM-used when data is linearly separable without any overlapping
        Soft Margin SVM-allows misclassification for better generalisation
  - - - Cluster-it is a subset of objects which are similar. The space between similar objects are less compared to objects which are not similar.
      - Clustering-The process of partitioning a set of data into meaningful subclasses, called clusters. Good clustering is when the intra class similarity is high and inter class similarity is low.
      - Categories- Partitioning, Hierarchial, Density based, Model based and Grid based
        
        Partioning-Dataset D of n objects is divided into k clusters without overlapping.
        Advantages-
        Simple and Understandable
        Items are automatically added to cluster
        Disadvantages-
        Number of clusters are determined beforehand
        All items are forced into a cluster
        Too sensitive to outliers
        Characteristics
        Take clusters that can be spherical in shape
        Distance based
        Take the mean/median of the dataset as the centre point
        Effective for small to medium sized datasets
        Disadvantage
        K means clustering is sensitive to outliers therefore take k medoids are used which are the centre points for other datapoints