Please enable JavaScript.
Coggle requires JavaScript to display documents.
ANALYTICS AND MACHINE LEARNING - Coggle Diagram
ANALYTICS AND MACHINE LEARNING
KEY CONCEPTS
Algorithm
Data
Model
Training
Testing
Features
Labels
APPLICATION OF ML
Image and Speech Recognition
Natural Language Processing
Recommendation System
Predictive Analytics
Autonomous systems
CHALLENGES IN ML
Data Quality and Quantity
Bias and Fairness
Overfitting and Underfitting
Computational Resources
MODELING PROCESS
Feature Engineering and Model Selection
Model Selection
Model consists of features and target variable eg tomorrow temp is target variable and windspeed, cloud movement are the features
Feature Engineering
Most important part of models. The best models will represent reality. Consulting an expert on the literature used is better. Cons-Availability bias. Limited data would cause bias in the training of the model
Training the model
Once the features are selected the model is trained. With the right predictors and modeling technique modeling can be done
Model Validation and Selection
Model Validation
Once a model is trained it is time to see whether it can be used in reality. A good model has two properties- good predictive power and generalises well with unseen data. For this we have to defina an error measure and validation strategy
Error measure Classification error measure for classification problem and mean squared error rate for regression problems
Validation Strategy- K folds cross validation-data is divided into 'k' types. one is used as testing data while k-1 is used for training. in this way all the data is used
Leave one out-only set of data is used for testing rest everything is used for training worked on smaller datasets.
TYPES OF ML
Supervised ML-both input and output data is labeled. this is done to learn patterns from data
Regression, Classification
Regression-form of a predictive analytics technique which studies the relationship between dependent and independent variable
Types-Linear, Logistic, Polynomial, Stepwise, Ridge, Lasso, Elasticnet
Classification-
Terminology Classifier-maps input variable with class
Classifcation model-output of classifier. Predicts the output of variable with class
Feature-it is like one of the columns in a dataset
Binary classification-where the output could have two option either male or female
Types of classification Algorithm- Linear classifier
Support Vector Machines
Kernel Estimation
Quadratic Classifer
Decision Trees
Neural Networks
Support Vector Machine-its a supervised machine learning model that uses classification algorithm for two group classification. It has higher speed and better performance with little number of samples
Key Terms
Hyperplane-it is the plane that best separates datapoints of different classes. In a 2D plane-it is a line,
3D-plane
higher D -hyperplane
Margin-distance between the hyperplane and the nearest datapoint.
Support Vectors-points closest to hyperplane defining margin
Eg-apples and oranges, and separating them on texture
Linear SVM-data is entirely separable
Non Linear SVM-data is not linearly separable and has to be projected to higher dimensions. That time a trick called kernel trick is used to do so
Linear Kernel-when data is linearly separable
Polynomial-used when data has polynomial relationship
Radial Basis function-used when data is not linearly separable and needs to be projected to higher dimensions
Sigmoid kernel - used in neural networks
Hard Margin SVM-used when data is linearly separable without any overlapping
Soft Margin SVM-allows misclassification for better generalisation
Unsupervised ML- learns patterns from unlabeled dataset and groups them based on patterns
Clustering, Dimensionality Reduction
Cluster-it is a subset of objects which are similar. The space between similar objects are less compared to objects which are not similar.
Clustering-The process of partitioning a set of data into meaningful subclasses, called clusters. Good clustering is when the intra class similarity is high and inter class similarity is low.
Categories- Partitioning, Hierarchial, Density based, Model based and Grid based
Partioning-Dataset D of n objects is divided into k clusters without overlapping.
Advantages-
Simple and Understandable
Items are automatically added to cluster
Disadvantages-
Number of clusters are determined beforehand
All items are forced into a cluster
Too sensitive to outliers
Characteristics
Take clusters that can be spherical in shape
Distance based
Take the mean/median of the dataset as the centre point
Effective for small to medium sized datasets
Disadvantage
K means clustering is sensitive to outliers therefore take k medoids are used which are the centre points for other datapoints
Semi Supervised ML- it includes small amount of both labeled and unlabeld dataset. Learns from labeled and applies it in unlabeled and finds patterns similar to clustering.
CONFUSION MATRIX