Please enable JavaScript.
Coggle requires JavaScript to display documents.
Machine Learning (Hypothesis Testing (t-Distribution (If sample size is…
Machine Learning
Hypothesis Testing
-
-
Null Hypothesis can have "=", "<=" or ">=" comparative
Alternate Hypothesis Can have "!=", ">" or "<" comparative
-
non-Directional - Two Tail Test - Null Hyp fails if > than High Critical value or < Low Critical Value
e.g. Average Score = 70
U/LCritical Value = Mean +/- (Z-Score*Standard Error)
Z would be calculated at 97.5% confidence for example
If Value falls between Critical Boundary, we fail to reject the Null Hypothesis. Otherwise, we accept the alternate Hypothesis
P Value - Probability value in Z-Table or % of population under Z score in normal distribution.
Z Score is ratio between (Mean-Observed Value)/Std Error
-
t-Distribution
If sample size is smaller than 30, we cannot use Z distribution table as the Bell curve would not be as tall as a Normal Curve.
In that cases where degrees of freedom (sample size - 1) < 30, we use T table instead of Z table to find the probability.
-
Because we are using sample Std Dev instead of Population Std Dev, there is variance and hence the chart will be different from the Normal Distribution chart
Classification
-
KNN can be used for both Classification, Regression (Supervised) or Clustering (un-Supervised)
Challenges
Population of one class vastly out-numbers the other.
Can be overcome by under/over sampling - can be done using balance in sklearn
for Multi-class, we assign to the class which has highest probability
If we are not able to find features for item, we classify using similarity. Eg. Pharmacy
if data behavior is time sensitive, then we might need to split train/test using date range instead of random. This will help us to see if a model build on earlier data still holds good now.
-
Use forward selection and back elimination, VIF for multicollinearity etc to optimize the number of features
For Categorical data, we can use one_hot_encoder.
This might not work where there are lot of unique values like Country names. We might need to fill in meaningful numbers instead.
Missing value imputation - Mean, Median or Mode. Sometime imputing using the subset/class of data might be better.
We can also compute the missing feature as a value of y using training/test set
-
Curse of Dimensionality - as Features grow, we need exponential quantity of data to have good generalized predictions - need dimension reduction
Choose Bias and Variance appropriately. High Bias will lead to under-fitting and high variance will lead to over-fitting.
Logistic Regression is a good binary Classifier but not a multi-class classifier. But KNN might act as both Binary or muti-class
Naive Bayes
-
-
-
-
If new word is in input text which is not part of the dictionary, then we get a probability of 0. To avoid this, we add a constant number to all occurrences - Laplace Smoothing
In Scikit learn, Laplace is implemented by default
-
Neural Networks
Activation Function
-
-
RELU/Leaky Relu - Most used
RELU returns 0 if < 0 and actual value if > 0
Leaky RELU returns small -ve if < 0
Support Vector Machine
-
Naive Bayes and Logistic Regression are more Linear boundary Classifications. SVM is good at non-linear problems also
Builds Hyper-plane with max distance of separation between the sets. Limitations is that this works only for linearly separable data only.
-
SVM can support non-linear classification using Kernels by making the features to higher dimension and building a hyper plane to segregate it.
Decision Tree
Very Good at handling Categorical data. Even Numeric data needs to be converted to Ranges/Categories
Decision Tree Regression - each leaf will have a linear regression model.
In Classification, the leaf will lead to a Label
Decision tree should try to split data into Homogeneous sets - Gini Index probability of even split - Higher is better and Entropy - Lower is better
Adv :
- Handle any kind of data
- No Normalization needed
- Intuitive
Dis-adv
- prone to over fitting
- Very unstable
CART trees - better suited for binary classification/regression
CHAID trees - good for multi-class classificaitons
Measuring
Accuracy %. But does not work when
- We have imbalanced data set
- Will not measure probability based accuracy
Confusion Matrix. For Multi Class classification, we see that diagonal element count should be high (where Y=Y-Pred)
False Positive - Positive comes from Prediction
Same way, True Negative, True Positive and False Negative
Precision = True Positive/(TP+FP)
Recall = TP/(TP+FN)
F1-Score = 2(PreRec)/(Pre+Rec)
TPR = True Positive / Positive = TP/(TP+FN)
FPR = False Positive/Negatives = FP/(FP+TN)
Log Loss function --> to minimize Log Loss should have high probability where YPred = 1 and Low probability where YPred = 0. Thus total log loss should tend to be close to Zero - Used in Classification
MAD = Median of Absolute Deviation of errors
MAD=median(abs(e(i) - median(e(i)))
This will help with noise of large outlier errors
For Classification use Area Under Curve of ROC Curve - Chart between TPR and FPR for various +ve cut-over % probability.
Model with a better area under the curve is a better model.
if both curves have the same area, then the curve that rises fast is better as it eliminates lots of -ve's for smaller %probability cut-over
Anomaly Detection
-
Methods
Density Based : DBScan, LOF
Distance Based: K-NN, K-Means, Regression Hyper-plane Distance
Parametric: GMM, Single Class SVMs, Extreme Value Theory
-
-
Approaches
Using K-Means
- Build clusters using K-Means
- Calculate the distance between Centroid and data point - Outlier distance threshold (need manual verification)
- Any point which has distance more than the biggest allowed will be marked as an outlier
DBSCAN - Density Based Spacial Clustering Applications with Noise
We need to choose - Neighborhood value (distance to mark points as clusterable) and min point parameter (Number of min points to mark it as cluster)
Create SuperPixel for the image segmentation before clustering. That way, we can reduce the number of pixels
-
Apply standard scalar on the obtained features.
Principal Component Analysis with n_components = 512.
Pass the remaining features to One Class SVM model or Isolation Forest
Regression
-
Simple Linear Regression
-
-
-
-
R^2 = 1-RSS/TSS
R^2 0 is bad and ~1 is good
We use R^2 so that the error calculation does not change based on the UOM of measure.
Eg. When measure changes from Kg to Gram.
-
-
-
Regularization
To avoid over-fitting by linear Regression, we use a regularization term value in the cost function
Lasso Regression
Most Computationally Costly
but the advantage is that it sees if the features contribute any predictability to the output. If not, it will weight those features to 0 so they are inconsequential
sklearn.linear_model.lasso
-
Ensemble
Build Multiple models on the same data set and get the mean (prediction) or highest occurrence (classification) to get the answer
Random Forest
-
-
For Classification, thumb rule is SQRT(num of features) and Regression Features/3
-
Boosting
ADABoost (classification) - Doing bucketing and after each model build, test it against the model against the test data and weight the miscalculated data to be better selected for future buckets
XGBoost
-
In XGBoost, we create a simple regression model and calculate the difference between Actual and predicted output. Second round, we will try to create a diff regression function that can predict the diff for the same input features.
-
-
Time Series Analysis
-
Stationary Data got by
-
Differencing - remove seasonality for example by doing difference of t-12 and then t-1 difference to make it stationary(remove trend)
After Stationary data is got, we can use ARIMA model to predict
-
-
For ARIMA we have 3 variables
p - for AR Lag
d - differencing
q - MA Lag
p & q we find using ACF and PACF
Neural Networks
ANN
Neuron, synapse(output of neuron)
-
-
-
Activation function - types
Threshold function (Yes/No)
Sigmoid Function (1/(1+e^-x))
Rectifier Function (most popular) = Max(x,0)
Hyperbolic Tangent - similar to sigmoid but goes between -1 to +1 = (1-e^-2x)/(1+e^-2x)
Batch Gradient Descent - update weights based on whole data set to minimize error
Stochastic Gradient Descent - for one data set at a time
mini-batch - doing GD for few rows
After that we do back propagation
This one cycle is called an epoch
CNN
Consists of the following layers
- Convolution layer - applies many filters (products) on the original image
1.1 Apply a function like RELU or leaky-RELU on the filtered output
- Apply a pooling algorithm like MaxPooling - which is getting max value of say a 2X2 matrix. Other are AvgPooling etc. Opposite of MaxPooling (encode) is UpSampling (decode).
- Flattening - convert the rows to columns
- Pass the flattened features to a fully-connected neural network
- Go through backpropagation and epochs including filter changes to come up with a good classification algorithm.
SoftMax function - when we have 2 or more choices to make, the sum of probability across all the value should not be > 1. So, if Probability of A or B, if Prob(A) = .75 then Prob(B) should = .25.
To achieve this, we divide the value of A/Sum(A+B). similarly for B
Cross-Entropy is similar to Log Loss. = -Sum(Y*Log(Prob-y^))
This is used to measure the error in a CNN algorithm.
Better than MSE because if the y^ is very small, then MSE will produce a small gradient in a change in y^. whereas Cross Entropy will generate a big gradient.
Only used for Classification problems and not regression problems.
-
Inferential Statistics
Probability
-
-
Random Variables : Numeric value assigned to Logical/Text outcomes. Example 3 Blue Balls + 1 Red Ball = 1, 4 Red Balls = 4. Yes/No = 1/0 etc.
-
Permutation Probability
-
-
-
When calculating probability of 3 red balls, We should find probability of each permutation and add them.
Binomial Distribution
if there are only 2 possibility, say Red & Blue ball, Probability of Red = 1-Prob of blue
-
-
-
-
Steps in Machine Learning
- Study the problem statement and identify the type of problem - Regression, Classification etc
- Read and join the data appropriately
- Drop duplicate columns
- Map values to numeric features
- Use Label+One Hot Encoder or get_dummies to encode values - Drop one column
- Check for outliers
- Check for correlation and drop highly correlated features. Also, try to chart features against each other with plain values or Log - for high variance.
- Impute missing values
- Normalize data,
9.1 Balance the data if it is used for Classification
- Train-test split
- use statsmodels.api to get the P value - if > 0.10, that feature is a candidate for elimination
OR
- Use RFE (Recursive Feature Elimination) to eliminate unwanted features. We can use SKLearn Hyper parameter tuning to help select optimal values
- Calculate VIF to identify multicollinearity to eliminate colinear features. Drop features with high VIF
- Fit and Predict values
- Evaluate outcomes
Steps in Machine Learning Pipeline
- Define Problem Statement
- Data Ingestion
- Data Preparation
- Data Segregation (Train, Test, Validation)
- Model Training
- Candidate Model Evaluation
- Model Deployment
- Performance Monitoring