Please enable JavaScript.
Coggle requires JavaScript to display documents.
Practical Statistics for Data Scientist (with Python) - Coggle Diagram
Practical Statistics for Data Scientist (with Python)
Exploratory Data Analysis
(cornerstone of any data science project)
Data Types
Numeric
Continuous
Discrete
Categorical
Binary
Ordinal
Rectangular data
Data frame
Estimates of location
Mean
Trimmed mean
Eliminates the influence of extreme value
Avoid influence of outliers
Weighted Mean
Median
Weighted Median
Robust to outliers
Outlier worthy for investigation
Estimates of Variability
Deviations
Variance
n-1
degree of freedom
DS do not worry about df
unbiased estimate
Standard deviation
Sensitive to outlier
Mean absolute deviation
Median absolute deviation from the median
Robust to outliers
Range
Order statistics(Ranks)
Percentile(Quantile)
50% is median
Interquartile range (IQR)
quantile(0.75) - quantile(0.25)
Explore the data distribution
Percentiles and Boxplot
max, P(75), median, P(25), min
Frequency tables and histograms
Density plots and estimates
Exploring Binary and Categorical data
Bar plot
Pie chart(not recommended)
Numerical data as categorical data (by histograms-has been binned)
Mode
Expected value
Business valuation and capital budgeting
probability
Correlation
Correlation coefficient
Pearson's correlation coefficient
Sensitive to outlier
If relationship is nonlinear, it is not useful
Correlation matrix
Heatmap by seaborn python
Scatterplots
Good for small number of data
Exploring Two or More variables
Hexagonal Binning and Contours (Numeric data)
Good for bigger number of data
Two Categorical variables
Contingency table
Categorical and Numerical Data
Boxplots
Violin plot
Enhanced with density plot
Visualizing Multiple Variables
Conditioning variables
Tableau
Data and Sampling Distributions
Random Sampling and Sample Bias
Terms(Sample, Population, Random Sampling, Stratified sampling, Stratum/Strata, Simple random sample, bias, sample bias)
Bias
Indicate that model is misspecified or important variable left out
Random Selection/Sampling
Representative
Reduce bias
Sampling procedure
Timing
Size versus quality: when does size matter?
Sample mean versus population mean
Selection Bias
Regression to the mean
Vast search effect, non-random sampling, cherry-picking, stop an experiment when results look interesting.
Sampling distribution of a statistic
Central limit theorem
Sample mean normally distributed if sampled a large enough number of times
Standard error: sums up the Variability of a sample statistic
Use bootstrap resamples to estimate SE
Is different with SD
The bootstrap
With replacement
R is the number of iterations
Can be used with multivariate data
A resampling method to assess the variability of a sample statistic
Another resampling is permutation procedures (without replacement)
Can be used for sample size determination
A good use is in Random Forest
Bagging
Confidence intervals
The lower the level of confidence you can tolerate, the narrower the confidence interval will be
Bootstrap is an effective way to construct CI
Distributions
Normal Distribution
Z-score
Standard Normal u=0, sd=1
QQ-plot
stats.probplot
Z-score v.s. quantile of normal distribution
How close it is to a normal distribution
Normalization/standardization
Long-Tailed Distributions
Can be seen by QQ-plot: normal in the middle
Black swans
Most data is not normal distributed
t-Distribution
Degrees of freedom
Not much used in DS
n>200, t-distribution coincides with z-distribution
Resembling the normal distribution with thicker tails
The population variation is unknown
Binomial Distribution
u = np, var= np(1-p), sd=sqrt(np*(1-p))
stats.binom.pmf(x,n,p)
stats.binom.cdf(x,n,p)
With large n and p close to 0.5, it can be approximated by normal distribution.
Chi-Square Distribution
Measures the extent of the departure from what expected in a null model
Goodness-of-fit test
an 'A/B/C ...test'
Check whether two categorical variables are independent
Low chi-square value means follow the expected distribution
Deal with counts
Multiple treatments
F-Distribution
Deal with continuous values
Multiple treatments
ANOVA
Linear regression
Poisson and Related Distributions
Poisson Distributions
u=Lambda, Var=Lambda
Exponential Distributions
Estimating the failure rate
Constant rate
Weibull Distribution
Mechanical failure
Shape parameter, beta
Scale parameter
Statistical Experiments and Significance Testing
A/B testing
Multi-arm bandit for Data Science
Traditional statistics
Hypothesis Tests/significance tests
Why?
Underestimate the scope of natural random behavior
The Null Hyphothesis
Alternative Hypothesis
One-Way v.s Two-Way Hypothesis Tests
Resampling
Bootstrap
Permutation tests(To compare the observed value of statistic to the resampled distribution)
Exhaustive permutation test
For small sample size
Bootstrap permutation test
jackknife
Statistical Significance and p-Values
p-value
The probability that, given a chance model, results as extreme as the observed results could occur.
To data science, p-value is just another point of information bearing the decision.
Alpha
p-value controversy
Practical significance
The threshold of "unusualness"
Type 1 and Type 2 errors
Basic function of Hypothesis test is to protest against being fooled by random chances, so minimize Type 1 errors.
t-Tests
Before advent of computers, resampling tests were not practical, so t-statistic is used
use sample variance
for two samples
Independent samples
equal variance
Using pooled variance estimated
unequal variances
Welch's T-test
Dependent samples
Paired
d=0
pre and post treatment effect
The population variation is unknown
Multiple testing
Alpha inflation
False discovery rate
Ways to solve
Adjustment of p-values-Bonferroni adjustment
Holdout samples
Degrees of freedom
Part of standardize test statistics, not really matter for data science's significant testing
Factoring of categorical variables(to avoid multicollinearity)
ANOVA(Analysis of variance)
F-statistic
Decomposition of Variance
Two-way ANOVA
Interaction effects identification
Chi-Square Test :question:
A resampling approach
Pearson residual (observed v.s. expected)
Fisher's Exact Test
Chi-square statistic
Relevance for data science
Determine appropriate sample size
Used as a filter to determine whether an effect or a feature is worthy of further consideration.
To assess goodness of it
To check whether two categorical variables are independent
Multi-Arm Bandit Algorithm
Test multiple treatments
Thompson's sampling
Efficiently handle 3+ treatments
Power and Sample Size
Effect size
Power
Sample size
Significance level
Z-Test
When sample size is large or variance of population is known
For population proportions when np0>=10 and n(1-p0)>=10
Regression and Prediction
Simple linear regression
Fitted values and Residuals
Use least squares to fit
Prediction v.s. Explanation(Profiling)
Multiple Linear Regression
RMSE(Root mean squared error)
MSE measures the variance of reisduals
MSE penalizes larger errors more than MAE, more sensitive to outliers
RSE(Residual standard error)
Adjusted for degrees of freedom
R-squared
The proportion of variance explained by the model
Coeffecient of determination
1 - RSS/TSS
t-statistic
compare the importance of variables of the model
Assessing the model
RMSE and RSE are quite same for big data application
statsmodels give more detailed analysis
adjusted R-squared effectively penalizing the addition of more predictors to a model
R-squared and adjusted R^2 are nearly same with large data sets
Cross-Validation
Model Selection and stepwise regression
R^2 adj or/and AIC penalize complexity/number of variables
Reduce AIC
Backward selection/elimination
Forward selection
Penalized regression
ridge regression and lasso regression
Weighted regression
MAE measures the average of residuals
RSS(Residual sum of squares)
TSS = RSS + ESS(explained variation)
Prediction Using Regression
The Dangers of Extrapolation
Extrapolation beyond the range of the data can lead to error
Confidence and prediction intervals
With bootstrap sample can also be used to produce Confidence and prediction intervals
Confidence intervals quantify uncertainty around regression coefficients
Prediction intervals quantify uncertainty in individual predictions
Which one to use Confidence intervals or prediction intervals?
Factor Variables(Categorical variables) In regression
Dummy variables Representation
One hot encoding
pd.get_dummies
With drop_first=True can avoid problem of multicollinearity
Factor Variables with Many levels
Consolidates based on median of the residual.
Ordered Factor variables
As a single numeric variable
Interpreting the regression equation
Correlated predictors
Multicollinearity
is not a such problem for nonlinear regression (like tress, clustering, nearest-neighbors.)
Cause numerical instability
Confounding variables
Important variable is not included
Some negative unintuitive coefficient
Interactions and Main effects
Model selection with interaction Terms
Prior knowledge and intuition
stepwise selection
Penalized regression
Most common approach: tree models, random forest and gradient boosted trees
Regression Diagnostics
Outliers
Detect by standardized residual
influence.resid_studentized_internal
Influential Values
Cook's distance
Influence plot/bubble plot
influence.cooks_distance
Remove points(outliers) with cook's distance
Useful only in a smaller data sets
Can be very useful in Purposes of anomaly detection
Heteroskedasticity, Non-Normality, and Correlated Errors
Lack of constant residual variance across the range o the predicted values.
Visualizing the data is the convenient way to analyze residuals
sns.regplot
May suggest an incomplete model
Assumption that errors are independent not that important for data scientist, the distribution of residuals is not critical in data science
Partial Residual Plots and Nonlinearity
can be used to qualitatively assess the fit for each regression term, possibly leading to alternative model specification
Visualize how well the estimated fit explains the relationship between a predictor and the outcome
sm.graphics.plot_ccpr
Multicollinearity
examining the variance inflation factor(VIF)
Removing correlated variables, linearly combining the variables or using PCA/PLS(partial least squares)
Confounding varaibles
selection bias
Using stratification to handle it
Polynomial and Spline Regression
Polynomial Regression
Spline
Spline regression is not necessarily better model
Knots
Generalized Additive Models(GAM)
LinearGAM
Assumptions
Linearity
Homoscedasticity
The variance of residuals is constant
Check it by plotting residuals versus the fitted values
If heteroscedasticity , transform the dependent variables or include nonlinear terms in model
Independence
i.i.d
Normality
i.i.d
The distribution of Y is assumed to be normal
The residuals are normally distributed
QQ plot
If not, transforming the dependent variable(with a log or square-root transformation) can help reduce skew
Classification
Key terms
Probability score(propensity)
Sliding cutoff
More than two categories
Convert the multiclass problem to a series of binary problem by conditional probabilities.
Naive bayes
The Naive Soution
Assume independent of predictor variables
sklearn.naive_bayes
MultinomialNB
Easy and require only a small amount of training data
Numeric predictor variables
Bin and convert to categorical predictor
Use a probability model-eg normal distribution
Works with categorical predictors and outcomes
Text classification
Discriminant Analysis
Linear discriminant analysis(LDA)
Less widely used with the advent of more sophisticated techniques(tree models and logistic regression)
Covariance Matrix
Fisher's Linear Discriminant
Assumes predictor variables are normally distributed if they are continuous numeric variables, technically.
In practice, it works well for nonextreme departures from normality
Maximize SSbetween and minimizing SSwithin
A simple example
Using Discriminant Analysis for Feature Selection
sklearn.discriminant_analysis
Extensions of DA
QDA
Logistic Regression
Logistic Response Function and Logit
Sigmoid function
Logistic Regression and the GLM
LogisticRegression from sklearn.linear_model
C and Penalty prevent overfitting, Set C to a very large value to fit without regularization.
Generalized Linear Models(GLM)
A probability distribution or family
A link function(Transformation function): Logit, log link, negative binomial, gamma
Predicted Values from Logistic Regression
logit_reg.predict_proba()
logit_reg.predict_log_proba()
Interpreting the Coefficients and Odds Ratios
Relative ease of interpretation than other classification models
Linear and Logistic Regression: similarities and differences
Difference
The way of the model fit(Least square is not applicable)
The nature and analysis of the residuals from the model
Fitting the model
Using maximum likelihood estimation(MLE)
Handling factor variables by one hot encoder
Accessing the model
sm.GLM with sm,families.binomial()
Stepwise regression, fit interaction, spline terms, confounding, correlated variables
Analysis of residuals
Less valuable than in regression but still useful
Logistic regression is fast and popular
Usually the first model employed, high-bias and low-variance model, need using regularization and removal of features to address highly correlated features.
Evaluating Classification Models
Accuracy
The proportion of predictions that are correct
First step in evaluating a model
Accuracy paradox for imbalanced classes
The Rare Class Problem
(TP+TN)/(TP+TN+FP+FN)
Confusion matrix :star:
Precision, Recall, and Specificity
Precision= TP/(TP+FP)
Recall/sensitivity = TP/(TP+FN)
Sensitivity in biostatistics and medical diagnostics
Ability to predict a positive outcome
Specificity = TN/(TN+FP)
ability to predict a negative outcome
Python functions
confusion_matrix
precision_recall_fscore_support
Trade-off between optimizing for precision or recall
F1= 2
precision *
recall/(precision + recall)
ROC curve (Receiver operating characteristics)
Capture the trade off between recall and specificity
Cutoff
True positive rate(y) v.s. False Positive Rate
AUC(Area underneath the curve)
AUC=1 perfect classifier
AUC=0.5 completely ineffective classifier(Random Classifier)
roc_auc_score
Ability of a model to distinguish 1s from 0s (How well the classifier separates classes
Lift
intermediate step in settling on an appropriate cutoff level
measures how effective a model is in identifying the 1s and decile by decile
Strategies for Imbalanced Data
Undersampling
downsample the prevalent class, dealing with smaller and more balanced data
When have enough data
Oversampling and Up/Down weighting
Upsample the rarer class by drawing additional rows with replacement(bootstrapping)
Using sample_weight
Attach weight to the rare or prevalent class
Data Generation
Perturbing existing records to create new records
imbalanced-learn
SMOTE
Cost-Based classification
In practice, accuracy and AUC are poor man's way
Exploring the predictions
Check Data Science for Business on imbalanced classes
SVM
RBF and Guassian kernel
Works well in high-dimensional spaces and smaller amount of data, interpretability is not good
Statistical machine learning
K-Nearest Neighbors (KNN)
Simple
All predictors must be in numeric form
KNeighborsClassifier
Output a probability(propensity) so it is important to set cutoff for imbalanced classes
Distance metrics
Euclidean distance
Manhattan distance
Mahalanobis distance
Accounts for the correlation between two variables by computing covariance matrix
One hot encoder
Multicollinearity is not an issue for KNN
Standardization(normalization, z-scores)
preprocessing.standardscaler(), scaler.transform
Mean or median can be used to scaled and interquartile range or standard deviation
Subjective knowledge matters, if some variable is more important, we can scale it up
Does not change the distributional shape
Choosing K
If K is too low, overfitting
If K is too high, oversmooth the data and miss out to capture the local structure
For high structured data (SNR) choose smaller K, for lower SNR, choose larger K
Choose Odd number to avoid ties
KNN as a feature engine
Often used as a first stage in predictive modeling, the predicted value is added back into as a predictor for second-stage(non-KNN) modeling
Tree models (CART)
DecisionTreeClassifier
plotDecisonTree
Print((textDecisionTree)
The Recursive Partitioning Algorithm
Measuring Homogeneity or Impurity
Gini impurity
Gini index
Entropy of information
calculate the Information Gain
Stopping the tree from growing
max_depth, min_sampes_split
GridSearchCV to combine exhaustive search with cross-validation
To Avoid overfit
Predicting a continuous value
Use RMSE to evaluate performance
DecisionTreeRegressor
How Trees are used
Tree models provide visual tool for exploring data and easy to communicate rules
Multiple-tree is better in performance but lose the Interpretability
Bagging and the Random Forest
Bagging
Bootstrap aggregating
Bootstrap resample
significantly Reduce variance
Random Forest
RandomForestClassifier
n_estimators
Blackbox
Noisy and overfitting
Variable importance
rf.feature
importances
Hyperparameters
nodesize/min_samples_leaf
maxnodes/max_leaf_nodes
maxnodes=2max_leaf_nodes-1
Using default may lead to overfitting and noisy data
Increase nodesize and set maxnodes will fit smaller trees
Cross-validation can test effects of setting different values for hyperparameters
quick training time(in parallel), prediction performance
random subset of features, preventing the important features from always being present at the tops of individual trees
Boosting
Adaboost(Adaptive boosting)
Ensure models with lower error have a bigger weight
It is popular based on tuning a variety of week learners
Stochastic gradient boosting
Like random forest by sampling observations and predictors
Without replacement
pseudo-residual
More common uses tree model
A sequence of models (powerful but requires more care)
XGBoost
using Stochastic gradient boosting
eta(learning_rate) prevents overfitting by reducing the change in weights
Has many parameters
XGBClassifier
subsample
heavily used due to its execution speed and model performance
Regularization:Avoiding Overfitting
penalize the complexity of the model by modifies the cost function
Alpha L1
Lambda L2
e.g. 1000, larger values less likely overfit data
Initial ideas from Ridge(L2) Regression and Lasso(L1) Regression
L1-absolute value
Coefficients Shrink to 0
feature selection method
L2-squared magnitude
elastic netL1 and L2 can be linearly combined
Hyperparameters and Cross-validation
Using cross validation
itertools.product to create all possible combinations of parameters
default eta 0.1
default max_depth 3
default subsample 1.0
default lambda 1 and alpha 0
Gradient Descent
Minimize Loss/Cost function
Learning rate(a)
Stochastic gradient descent(SGD) to avoid being stuck in local minimum and saddle point
Gradient boosting is a generalized form of Adaboost
Unsupervised Learning
Principle Component Analysis
PCA(n_components=2)
Computing the principal components
Only for numeric variables
maximize the percent of total variance explained
Interpreting principle components
screenplot by pca.explained_variance
How many components to choose?
80%
cross-validation
correspondence analysis
For categorical data
not useful for big data context
Linear combinations
Applications :star:
Data visualization
Help understand the underlying structure and relationships among the data points
Feature extraction
Noise reduction
Data compression
Limitation
Assume linear relationship between features
Not robust for outliers
Applications
To create a predictive rule in the absence of a labeled response
for the prediction
Cold-start problem
Extension of the EDA
Reduce dimension
Curse of dimensionality
PCA is most commonly used
K-means clustering
Kmeans(n_cluster).fit(df)
sns.scatterplot
K-Means Algorithm
n_init by default is 10 times
running multiple times with different initializations
Recommend to run several times
Interpreting the clusters
size of the clusters
Counter(kmeans.labels)
relatively balanced, if not result from distant outliers and record very distinct from the rest of data
centers of clusters
kmeans.cluster
centers
Cluster Analysis Versus PCA
sign of the cluster means is meaningful while PCA identifies principle directions of variation.
Selecting the number of clusters
Using dictation by application
elbow method
may have drawback on some data
Practical consideration dominate choice of K
Important questions to ask
How likely are the clusters to be replicated on new data?
Are the clusters interpretable
Do they relate to a general characteristic of the data or just reflect a specific instance?
using cross-validation to evaluate
How?
Arbitrarily select centroids
iteratively update centroid until convergence
minimizing the loss function(objective function)
Euclidean distance
Hierarchical Clustering
Only can focus on small data set because cost is high
Highly intuitive graphical and interpretable
linkage(df)
The Dendrogram
fcluster
Do not need to prespecify the number of clusters
cutree
The Agglomerative algorithm
iteratively merges similar clusters
complete-linkage method is one measure of dissimilarity
Measures of dissmilarity
complete linkage
single linkage
average linkage
minimum variance
Model-based clustering(more recently)
Multivariate Normal Distribution(Gaussian Mixture model-GMM)
Mixtures of normals
Selecting the number of clusters
BIC
Try to learn the true value of k
Limitations
Require an underlying assumption of a model for the data
Computations is high
More flexible(Because takes into account the mean and variance not only mean)
Scaling and Categorical Variables :star:
Unsupervised learning requires scaling :star:
Categorical data can be problematic for PCA and k-means
Scaling and Categorical Variables
preprocessing.StandardScaler() to result a set of balanced clusters
Dominant variables
Rescale or exclude them
Categorical data and Gower's Distance
convert to numeric data by ranking or encoding
Problems with clustering mixed data
K-means and PCA are most appropriate for continuous variables
hierarchical clustering with Gower's distance is for smaller data
For a large data set, could apply clustering on subsets on specific categorical values
Model Evaluation and Selection :star:
Bias-Variance Trade-off
Model complexity and overfitting
Regularization
Interpretability and explainability
Trade-off between performance and model interpretability
Model Training
Cross-validation
k-fold cross-validation
Leave-one-out cross-validation (LOOCV)
Train validation split
Special way for Time-series data
Bootstrapping and bagging
Imbalance class
ensemble learning
Bagging
Hyperparameter Tuning
Grid search
random search
Training times and learning curves
identify if the model was overfitting
Validation error is not improving
help in discovering dataset is representative
a large gap between the training curve and validation curve