Please enable JavaScript.

Coggle requires JavaScript to display documents.

Practical Statistics for Data Scientist (with Python) - Coggle Diagram

- - - - Continuous
      - Discrete
    - - Binary
      - Ordinal
  - - - Trimmed mean
        
        Eliminates the influence of extreme value
        
        Avoid influence of outliers
      - Weighted Mean
    - - Weighted Median
      - Robust to outliers
      - Outlier worthy for investigation
  - - - n-1
        
        degree of freedom
        
        DS do not worry about df
        
        unbiased estimate
    - - Sensitive to outlier
    - - Robust to outliers
    - - 50% is median
    - - quantile(0.75) - quantile(0.25)
  - - - max, P(75), median, P(25), min
  - - - Business valuation and capital budgeting
  - - - Pearson's correlation coefficient
      - Sensitive to outlier
      - If relationship is nonlinear, it is not useful
    - - Heatmap by seaborn python
    - - Good for small number of data
  - - - Good for bigger number of data
    - - Contingency table
    - - Boxplots
      - Violin plot
        
        Enhanced with density plot
    - - Conditioning variables
      - Tableau
- - - - Indicate that model is misspecified or important variable left out
    - - Representative
        
        Reduce bias
      - Sampling procedure
        
        Timing
  - - - Sample mean normally distributed if sampled a large enough number of times
    - - Use bootstrap resamples to estimate SE
      - Is different with SD
  - - - Another resampling is permutation procedures (without replacement)
    - - Bagging
  - - - Z-score
      - Standard Normal u=0, sd=1
      - QQ-plot
        
        stats.probplot
        
        Z-score v.s. quantile of normal distribution
        
        How close it is to a normal distribution
      - Normalization/standardization
    - - Can be seen by QQ-plot: normal in the middle
      - Black swans
      - Most data is not normal distributed
    - - Degrees of freedom
      - Not much used in DS
        
        n>200, t-distribution coincides with z-distribution
      - Resembling the normal distribution with thicker tails
      - The population variation is unknown
    - - u = np, var= np(1-p), sd=sqrt(np*(1-p))
      - stats.binom.pmf(x,n,p)
      - stats.binom.cdf(x,n,p)
      - With large n and p close to 0.5, it can be approximated by normal distribution.
    - - Measures the extent of the departure from what expected in a null model
      - Goodness-of-fit test
        
        an 'A/B/C ...test'
        
        Check whether two categorical variables are independent
      - Low chi-square value means follow the expected distribution
      - Deal with counts
      - Multiple treatments
    - - Deal with continuous values
      - Multiple treatments
      - ANOVA
      - Linear regression
    - - Poisson Distributions
        
        u=Lambda, Var=Lambda
      - Exponential Distributions
        
        Estimating the failure rate
        
        Constant rate
      - Weibull Distribution
        
        Mechanical failure
        
        Shape parameter, beta
        
        Scale parameter
- - - - Underestimate the scope of natural random behavior
  - - - Exhaustive permutation test
        
        For small sample size
      - Bootstrap permutation test
  - - - The probability that, given a chance model, results as extreme as the observed results could occur.
      - To data science, p-value is just another point of information bearing the decision.
    - - p-value controversy
      - Practical significance
      - The threshold of "unusualness"
    - - Basic function of Hypothesis test is to protest against being fooled by random chances, so minimize Type 1 errors.
  - - - for two samples
        
        Independent samples
        
        equal variance
        
        Using pooled variance estimated
        
        unequal variances
        
        Welch's T-test
        
        Dependent samples
        
        Paired
        
        d=0
        
        pre and post treatment effect
  - - - Adjustment of p-values-Bonferroni adjustment
      - Holdout samples
  - - - Decomposition of Variance
    - - Interaction effects identification
  - - - Pearson residual (observed v.s. expected)
    - - Determine appropriate sample size
      - Used as a filter to determine whether an effect or a feature is worthy of further consideration.
- - - - MSE measures the variance of reisduals
        
        MSE penalizes larger errors more than MAE, more sensitive to outliers
    - - Adjusted for degrees of freedom
    - - The proportion of variance explained by the model
      - Coeffecient of determination
      - 1 - RSS/TSS
    - - compare the importance of variables of the model
    - - RMSE and RSE are quite same for big data application
      - statsmodels give more detailed analysis
      - adjusted R-squared effectively penalizing the addition of more predictors to a model
      - R-squared and adjusted R^2 are nearly same with large data sets
    - - R^2 adj or/and AIC penalize complexity/number of variables
        
        Reduce AIC
      - Backward selection/elimination
      - Forward selection
      - Penalized regression
        
        ridge regression and lasso regression
      - Weighted regression
    - - TSS = RSS + ESS(explained variation)
  - - - Extrapolation beyond the range of the data can lead to error
    - - With bootstrap sample can also be used to produce Confidence and prediction intervals
      - Confidence intervals quantify uncertainty around regression coefficients
      - Prediction intervals quantify uncertainty in individual predictions
      - Which one to use Confidence intervals or prediction intervals?
  - - - One hot encoding
        
        pd.get_dummies
        
        With drop_first=True can avoid problem of multicollinearity
    - - Consolidates based on median of the residual.
    - - As a single numeric variable
  - - - is not a such problem for nonlinear regression (like tress, clustering, nearest-neighbors.)
      - Cause numerical instability
    - - Important variable is not included
      - Some negative unintuitive coefficient
    - - Model selection with interaction Terms
        
        Prior knowledge and intuition
        
        stepwise selection
        
        Penalized regression
        
        Most common approach: tree models, random forest and gradient boosted trees
  - - - Detect by standardized residual
        
        influence.resid_studentized_internal
    - - Cook's distance
        
        Influence plot/bubble plot
        
        influence.cooks_distance
        
        Remove points(outliers) with cook's distance
      - Useful only in a smaller data sets
      - Can be very useful in Purposes of anomaly detection
    - - Lack of constant residual variance across the range o the predicted values.
      - Visualizing the data is the convenient way to analyze residuals
        
        sns.regplot
      - May suggest an incomplete model
      - Assumption that errors are independent not that important for data scientist, the distribution of residuals is not critical in data science
    - - can be used to qualitatively assess the fit for each regression term, possibly leading to alternative model specification
      - Visualize how well the estimated fit explains the relationship between a predictor and the outcome
        
        sm.graphics.plot_ccpr
    - - examining the variance inflation factor(VIF)
      - Removing correlated variables, linearly combining the variables or using PCA/PLS(partial least squares)
    - - selection bias
        
        Using stratification to handle it
  - - - Spline regression is not necessarily better model
      - Knots
    - - LinearGAM
  - - - The variance of residuals is constant
      - Check it by plotting residuals versus the fitted values
      - If heteroscedasticity , transform the dependent variables or include nonlinear terms in model
    - - i.i.d
    - - i.i.d
      - The distribution of Y is assumed to be normal
      - The residuals are normally distributed
        
        QQ plot
        
        If not, transforming the dependent variable(with a log or square-root transformation) can help reduce skew
- - - - Convert the multiclass problem to a series of binary problem by conditional probabilities.
  - - - Assume independent of predictor variables
      - sklearn.naive_bayes
        
        MultinomialNB
      - Easy and require only a small amount of training data
    - - Bin and convert to categorical predictor
      - Use a probability model-eg normal distribution
  - - - Less widely used with the advent of more sophisticated techniques(tree models and logistic regression)
    - - Assumes predictor variables are normally distributed if they are continuous numeric variables, technically.
      - In practice, it works well for nonextreme departures from normality
      - Maximize SSbetween and minimizing SSwithin
      - A simple example
        
        Using Discriminant Analysis for Feature Selection
        
        sklearn.discriminant_analysis
      - Extensions of DA
        
        QDA
  - - - Sigmoid function
    - - LogisticRegression from sklearn.linear_model
        
        C and Penalty prevent overfitting, Set C to a very large value to fit without regularization.
    - - A probability distribution or family
      - A link function(Transformation function): Logit, log link, negative binomial, gamma
    - - logit_reg.predict_proba()
      - logit_reg.predict_log_proba()
    - - Relative ease of interpretation than other classification models
    - - Difference
        
        The way of the model fit(Least square is not applicable)
        
        The nature and analysis of the residuals from the model
      - Fitting the model
        
        Using maximum likelihood estimation(MLE)
      - Handling factor variables by one hot encoder
    - - sm.GLM with sm,families.binomial()
      - Stepwise regression, fit interaction, spline terms, confounding, correlated variables
      - Analysis of residuals
        
        Less valuable than in regression but still useful
      - Logistic regression is fast and popular
  - - - The proportion of predictions that are correct
      - First step in evaluating a model
      - Accuracy paradox for imbalanced classes
        
        The Rare Class Problem
      - (TP+TN)/(TP+TN+FP+FN)
    - - Precision= TP/(TP+FP)
      - Recall/sensitivity = TP/(TP+FN)
        
        Sensitivity in biostatistics and medical diagnostics
        
        Ability to predict a positive outcome
      - Specificity = TN/(TN+FP)
        
        ability to predict a negative outcome
      - Python functions
        
        confusion_matrix
        
        precision_recall_fscore_support
      - Trade-off between optimizing for precision or recall
        
        F1= 2 precision * recall/(precision + recall)
    - - Capture the trade off between recall and specificity
      - Cutoff
      - True positive rate(y) v.s. False Positive Rate
    - - AUC=1 perfect classifier
      - AUC=0.5 completely ineffective classifier(Random Classifier)
      - roc_auc_score
      - Ability of a model to distinguish 1s from 0s (How well the classifier separates classes
    - - intermediate step in settling on an appropriate cutoff level
      - measures how effective a model is in identifying the 1s and decile by decile
  - - - downsample the prevalent class, dealing with smaller and more balanced data
      - When have enough data
    - - Upsample the rarer class by drawing additional rows with replacement(bootstrapping)
      - Using sample_weight
        
        Attach weight to the rare or prevalent class
    - - Perturbing existing records to create new records
      - imbalanced-learn
      - SMOTE
    - - In practice, accuracy and AUC are poor man's way
- - - - Euclidean distance
      - Manhattan distance
      - Mahalanobis distance
        
        Accounts for the correlation between two variables by computing covariance matrix
    - - Multicollinearity is not an issue for KNN
    - - preprocessing.standardscaler(), scaler.transform
      - Mean or median can be used to scaled and interquartile range or standard deviation
      - Subjective knowledge matters, if some variable is more important, we can scale it up
      - Does not change the distributional shape
    - - If K is too low, overfitting
      - If K is too high, oversmooth the data and miss out to capture the local structure
      - For high structured data (SNR) choose smaller K, for lower SNR, choose larger K
      - Choose Odd number to avoid ties
    - - Often used as a first stage in predictive modeling, the predicted value is added back into as a predictor for second-stage(non-KNN) modeling
  - - - Gini impurity
        
        Gini index
      - Entropy of information
        
        calculate the Information Gain
    - - max_depth, min_sampes_split
      - GridSearchCV to combine exhaustive search with cross-validation
      - To Avoid overfit
    - - Use RMSE to evaluate performance
      - DecisionTreeRegressor
    - - Tree models provide visual tool for exploring data and easy to communicate rules
      - Multiple-tree is better in performance but lose the Interpretability
  - - - Bootstrap aggregating
      - Bootstrap resample
      - significantly Reduce variance
    - - RandomForestClassifier
        
        n_estimators
      - Blackbox
      - Noisy and overfitting
    - - rf.featureimportances
    - - nodesize/min_samples_leaf
      - maxnodes/max_leaf_nodes
        
        maxnodes=2max_leaf_nodes-1
      - Using default may lead to overfitting and noisy data
      - Increase nodesize and set maxnodes will fit smaller trees
      - Cross-validation can test effects of setting different values for hyperparameters
  - - - Ensure models with lower error have a bigger weight
      - It is popular based on tuning a variety of week learners
    - - Like random forest by sampling observations and predictors
        
        Without replacement
      - pseudo-residual
      - More common uses tree model
    - - using Stochastic gradient boosting
      - eta(learning_rate) prevents overfitting by reducing the change in weights
      - Has many parameters
      - XGBClassifier
      - subsample
      - heavily used due to its execution speed and model performance
    - - penalize the complexity of the model by modifies the cost function
      - Alpha L1
      - Lambda L2
        
        e.g. 1000, larger values less likely overfit data
      - Initial ideas from Ridge(L2) Regression and Lasso(L1) Regression
        
        L1-absolute value
        
        Coefficients Shrink to 0
        
        feature selection method
        
        L2-squared magnitude
        
        elastic netL1 and L2 can be linearly combined
    - - Using cross validation
      - itertools.product to create all possible combinations of parameters
      - default eta 0.1
      - default max_depth 3
      - default subsample 1.0
      - default lambda 1 and alpha 0
    - - Minimize Loss/Cost function
      - Learning rate(a)
      - Stochastic gradient descent(SGD) to avoid being stuck in local minimum and saddle point
- - - - Only for numeric variables
      - maximize the percent of total variance explained
    - - screenplot by pca.explained_variance
    - - 80%
      - cross-validation
    - - For categorical data
      - not useful for big data context
    - - Data visualization
        
        Help understand the underlying structure and relationships among the data points
      - Feature extraction
      - Noise reduction
      - Data compression
    - - Assume linear relationship between features
      - Not robust for outliers
  - - - Cold-start problem
    - - Curse of dimensionality
      - PCA is most commonly used
  - - - n_init by default is 10 times
        
        running multiple times with different initializations
      - Recommend to run several times
    - - size of the clusters
        
        Counter(kmeans.labels)
        
        relatively balanced, if not result from distant outliers and record very distinct from the rest of data
      - centers of clusters
        
        kmeans.clustercenters
      - Cluster Analysis Versus PCA
        
        sign of the cluster means is meaningful while PCA identifies principle directions of variation.
    - - Using dictation by application
      - elbow method
        
        may have drawback on some data
      - Practical consideration dominate choice of K
      - Important questions to ask
        
        How likely are the clusters to be replicated on new data?
        
        Are the clusters interpretable
        
        Do they relate to a general characteristic of the data or just reflect a specific instance?
        
        using cross-validation to evaluate
    - - Arbitrarily select centroids
      - iteratively update centroid until convergence
      - minimizing the loss function(objective function)
      - Euclidean distance
  - - - fcluster
      - Do not need to prespecify the number of clusters
      - cutree
    - - iteratively merges similar clusters
      - complete-linkage method is one measure of dissimilarity
    - - complete linkage
      - single linkage
      - average linkage
      - minimum variance
  - - - BIC
      - Try to learn the true value of k
    - - Require an underlying assumption of a model for the data
      - Computations is high
  - - - preprocessing.StandardScaler() to result a set of balanced clusters
    - - Rescale or exclude them
    - - convert to numeric data by ranking or encoding
    - - K-means and PCA are most appropriate for continuous variables
      - hierarchical clustering with Gower's distance is for smaller data
      - For a large data set, could apply clustering on subsets on specific categorical values
- - - - Bagging
  - - - Validation error is not improving
    - - a large gap between the training curve and validation curve