Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Analytics - Coggle Diagram

- - - - Lift (not directional)
        
        Lift of (X --> Y)
        
        Result
        
        Lift = 1 = customers who buy X are as likely to buy Y as any other customer
        
        Lift > 1 = customers who buy X are more likely to buy Y (complimentary)
        
        Lift < 1 = customers who buy X are less likely to buy Y (substitute)
        
        S(X --> Y) / S(X) * S(Y)
        
        P(X, Y) / P(X)P(Y)
        
        Could the rule be just by chance?
        
        Pure Chance
        
        90% of customers buy X
        
        90% of customers buy Y
        
        By pure chance .9 * .9 = 81% of transactions include itemset {X, Y}
      - Support (not directional)
        
        Support Percentage
        
        of a single itemset
        
        of a rule
        
        How frequent/common is the rule?
        
        Support Count
        
        of a single itemset
        
        of a rule
      - Confidence (directional)
        
        Confidence of Y --> X
        
        How confident are we about the direction of the rule?
        
        Confidence of X --> Y
    - - dataset/transaction
        
        Itemset = subset of U
        
        Instance of an itemset
        
        Example: X = {Coffee, Bagel} = 2-item itemset
        
        Any combo of existing items in U
      - U = universal set of items
    - - Association rules
        
        If X then Y
        
        Relationship between itemsets NOT items
        
        X --> Y
        
        One-directional
        
        X & Y canNOT overlap
    - - Problem: with N items, there are 2 to the n power potential itemsets
      - Apriori Algorithm
        
        Steps
        
        Check all 1-item itemsets & keep only frequent ones
        
        Check all 2-item itemsets from prev. step, keep only frequent ones
        
        Keep going until you checked frequent itemsets of all sizes
        
        Generate all association rules
        
        Find rules where confidence >= minconf
        
        Only consider high support
        
        Find all itemsets support >= minsupp a.k.a. frequent itemsets
        
        If X is NOT frequent, then any larger itemset containing X is NOT frequent
  - - - Hierarchical Clustering
        
        Process
        
        Prep
        
        Distance between points
        
        Distance matrix
        
        Each data point as 1-point cluster
        
        Shortest distance (inter-similarity) points merged
        
        Merge 2 next-closest clusters
        
        Repeat
        
        Dendrogram
        
        Reading It
        
        Best K-Clusters Solution
        
        Repeat
        
        Cut-off
        
        Best 2-Clusters Solution
        
        Best K-Clusters Solution
        
        Application
      - Which one to choose?
        
        Hierarchical
        
        Cons
        
        Computationally demanding
        
        Pros
        
        Data-driven
        
        Solutions to any # of clusters
        
        K-Means
        
        Pros
        
        Less computationally demanding
        
        Cons
        
        Poor initialization = bad results
        
        Not good for irregular shapes, noisy data or clusters with different densities
      - Partitioning-based Clustering
        
        Process
        
        decide cluster numbers
        
        Algorithm pick k data-points at random
        
        Assign all other data to the closer centroid
        
        Update centroid based on new clusters
        
        Repeat step 2
        
        Repeat step 3
        
        Keep repeating until convergence
        
        Application
      - Evaluating Clusters
        
        Within Sum of Squared Errors (WSS)
        
        Process
        
        We have K clusters
        
        Centroid for each cluster is m
        
        For X in cluster C the error = distance to its own cluster's center
        
        Sum for all data points in a cluster
        
        Repeat for all clusters
        
        Sum them all together
        
        Interpreting
        
        Used to compare different partitions for the same dataset
        
        Lower WSS = Higher intra-similarity
        
        Clusters are more cohesive within
        
        Between Sum of Squared Errors (BSS)
        
        Process
        
        We have K clusters
        
        Centroid for each cluster is m
        
        Centroid for all points is m*
        
        Distance between cluster centroid (m) & global centroid (m*)
        
        Sum distances (weighted by # of points in the cluster)
        
        Interpreting
        
        Higher BSS = Lower inter-similarity
        
        Used to compare different partitions for the same dataset
        
        How to choose K?
        
        No "right" answer BUT there are better answers
        
        Eyeballing
        
        Elbow Test
    - - Input
        
        dataset of n columns/features and m rows/records
      - Output
        
        Clusters
      - What are good clusters?
        
        high intra-similarity
        
        low inter-similarity
      - Distance Metrics
        
        Numerical Data
        
        Max-coordinate
        
        the max among the absolute differences
        
        Euclidean
        
        shortest distance between 2 points
        
        Manhattan
        
        sum of the absolute difference of their coordinates
        
        Binary Data
        
        Matching
        
        number of mismatches divided by the total number of attributes (k)
        
        range is always [0,1]
        
        higher = more distant
        
        Used for symmetric data
        
        Jaccard
        
        excludes matches where a = 0 and b = 0
        
        Used for asymmetric data
        
        Categorical Data
        
        Matching
        
        Taxonomy based approach
        
        using industry-standard product hierarchy
        
        Translation-based approach
        
        Distance between 2 city names can be replaced by geographical distance
      - Data Normalization
        
        Min-Max
        
        rescaled all values between 0 and 1 using the mind and max
        
        (x - min) / (max - min)
        
        Standardization/Z-score
        
        transforms the data to have a mean of 0 and a standard deviation of 1
        
        (x - sample mean) / (sample standard deviation)
- - - - k-NN/Nearest Neighbors
        
        Choosing k
        
        accuracy = 1 - error rate
        
        lowest error rate in testing data
        
        error rate = % of misclassified observations
        
        Trade-offs
        
        Overfitting
        
        small k values
        
        sensitive to noise
        
        sensitive to outliers
        
        Underfitting
        
        large k values
        
        miss local structure of data
        
        Measure
        
        Distance metrics (ex. Euclidian)
        
        decide k = how many neighbors we consider?
        
        Key Idea
        
        identify k = "nearest"/similar observation
        
        nearest neighbors "vote" for their own class
        
        majority = predicted class
        
        Procedure
        
        Pick K = how many neighbors we will consider
        
        Pick distance measure + should we normalize?
        
        Calculate Euclidian distance between new & existing
        
        Pick the k-nearest neighbors
        
        Assign to majority class
        
        Pros
        
        Easy to implement/use
        
        Does NOT require assumptions
        
        Cons
        
        "Lazy" classifier
        
        No "real" model building
        
        Bad for large dataset
        
        Slow learner - bad for real-time prediction
      - Decision Tree
        
        Cons
        
        Unstable: slight change in data = very different split
        
        Splits are one attribute at a time = miss interesting relationship
        
        Pros
        
        Robust to outlier
        
        Good for variable selection
        
        Do NOT require assumptions
        
        Purity/Entropy
        
        Binary data
        
        Impure = when a leaf contains true & false for outcome variable
        
        lower entropy = purer
        
        lower entropy = more observations in partition that have same value for outcome variable
        
        weighted entropy = weighted avg of entropy for the leaves
        
        Numeric data
        
        sort the rows by age
        
        Calculate avg age for all adjacent people
        
        calculate weighted entropy values for each avg
        
        Terminology
        
        Leaf Node = final node
        
        arrows to them
        
        contains class label
        
        Root Node = very top
        
        arrows away from it
        
        Internal Nodes/Branches/Decision Nodes = in betweens
        
        arrows to them
        
        arrows away from them
        
        Building
        
        Internal Nodes/Branches/Decision Nodes = in betweens
        
        Construct all possible attribute tresholds
        
        Pick the one with the largest information gain as the root node
        
        Continue splitting based on information gain
        
        The same attribute can be used again if splits do NOT overlap
        
        Not all available attributes need to be used
        
        Information Gain
        
        = Entropy (Parent node) - Weighted entropy (Children node)
        
        Largest = root
        
        Can we reduce impurity by splitting the chosen root on other attributes? Repeat
        
        Use majority rule to label the leaf nodes
        
        Stopping Criteria
        
        When all data points in a node are from the same class
        
        There are no remaining attributes to split on
    - - Randomly split data into
        
        testing data
        
        training data
        
        validation data
      - Model Construction/Training Step
        
        Choose classification model
        
        Train it on training data
        
        End up with a trained model
      - Model Validation/Validation Step
        
        Refine/fine-tune trained model on validation data
        
        Sometimes skip this step
        
        End up with a refined trained model
      - Model Testing/Testing Step
        
        Assess how accurate model is
        
        Use evaluation metrics
        
        Receive testing performance
        
        Apply model to new real data
    - - Unbalanced dataset
        
        Accuracy is a bad measure
        
        Some classes more important
      - Confusion Matrix
        
        F1 - Score
        
        Better than Accuracy for Unbalanced data
        
        If we care about recall AND precision
        
        = (2 x recall x precision) / (recall + precision)
        
        Recall
        
        Recall for negative class = Specificity = TN / (TN + FP)
        
        For each ACTUAL class, how many were recovered?
        
        Recall for positive class = Sensitivity = TP / (TP + FN)
        
        Precision
        
        For each PREDICTED class, how many did we get right?
        
        Precision for positive class = Positive Predictive Value (PPV) = TP / (TP + FP)
        
        Precision for negative class = Negative Predictive Value (NPV) = TN / (TN + FN)
        
        Error Rate = (FN + FP) / total observations
        
        Accuracy = (TP + TN) / total observations
        
        Values
        
        False Positive (FP): Predict YES, Actual NO
        
        True Positive (TP): Predict YES, Actual YES
        
        True Negative (TN): Predict No, Actual No
        
        False Negative (FN): Predict NO; Actual YES
    - - independent variables (x)
      - 1 dependent variable (y)
      - predict well-defined outcome
      - binary/categorical
      - Methods
        
        k-NN
        
        Decision Tree
  - - - Attributes of observation
        
        Independent Variable
        
        Attributes of an observation used to predict Y
        
        X
        
        covariates
        
        predictors
        
        Dependent Variable
        
        numerical/continuous
        
        Y
        
        represents outcome
      - Observation/Record
      - Unit of observation defined by variable/available attributes
      - Split data like for classification
        
        validation data
        
        training data
        
        testing data
    - - Basics
        
        Sum of Squared Deviations instead of Entropy
        
        decision tree for numerical outcomes
        
        data are split based on values of attributes
        
        Average of the outcome values (for leaf nodes)
      - Leaf Nodes' Class
        
        Determined by the avg. of the outcome values of the data points in that node
        
        Take the Y values of all observation in the leaf node
        
        Sum these values
        
        Divide the sum by the # of observations in the leaf node
      - Split Rule
        
        No upper limit
        
        Split data so it minimizes SSD
        
        Lowest possible value 0 if all the outcomes in the partition have the exact same value
        
        lower SSD = more "agreement" in the outcome
        
        Splits are made using the sum of squared deviations (SSD) from the average outcome value at that node
        
        Calculate the avg. of the outcome values of all the data points in the node aka miu
        
        Take the Y value of each observation and subtract miu
        
        Square the difference
        
        Sum the squares
      - Pros
        
        robust to outliers
        
        no parametric assumptions
        
        good for variable selection
      - Cons
        
        unstable - slight change in the data can lead to very different splits
        
        splits are done 1 attribute at a time --> miss interesting relationships between attributes
      - Output
        
        Given new data we follow the rules of the tree
        
        Regression tree discretize the outcomes
        
        aka putting the outcomes into different segments
        
        not as good as more detailed predictive outcomes
    - - Basics
        
        might use a weighted avg, with weight decreasing with distance
        
        need to decide k
        
        for a given observation, find the k-nearest neighbors
        
        neighbors = observations with shortest distance from new observation
        
        numeric prediction = avg. of the nearest neighbor's outcomes
      - Pros
        
        does not make assumptions
        
        simple
        
        effectively captures complex relationships without really building a model
      - Cons
        
        Curse of dimensionality
        
        lots of attributes = need lots of observation to have a good prediction
        
        Slow
        
        not good for "real time" predictions
        
        especially bad with large dataset
      - Steps
        
        pick k
        
        pick the distance measure + should we normalize?
        
        usually use standardization
        
        if yes, normalize first, compute distance second
        
        outcome variable NOT normalized
        
        Compute the pairwise distance between new observation & all the rest
        
        Pick the k-nearest neighbor (shortest distance)
        
        Predictor = avg. of Ys of neighbors
    - - MAE = Mean Absolute Error
        
        tells us the magnitude of the avg. error in any direction
        
        direction of errors is lost (aka over/under-prediction)
        
        absolute value of the error
        
        sum them & take the avg.
        
        negative numbers are transformed into positive
        
        positive numbers stay positive
      - Prediction Error
        
        difference between predicted outcome & actual outcome
        
        predicted value - actual value
        
        overprediction
        
        predicted value > actual value
        
        underprediction
        
        predicted value < actual value
      - ME = Mean Error
        
        avg. of prediction errors
        
        not very informative
        
        if positive
        
        on avg. overpredicting
        
        if negative
        
        on avg, underpredicting
      - MAPE = Mean Absolute Percentage Error
        
        Equation
        
        absolute value of error
        
        divide by actual value
        
        sum them
        
        take the avg
        
        multiply by 100
        
        No direction
        
        relative to the actual values
        
        gives % of how much predictions deviate from actual values
      - Total SSE = Total Sum of Squared Error
        
        square of prediction error is already large, it will get even larger and increase total SSE by a lot
        
        penalizes larger errors
        
        square the prediction errors and sum
      - RMSE = Root Mean Squared Error
        
        Equation
        
        square the prediction errors
        
        sum them
        
        take the avg.
        
        square root the whole thing
        
        measures the avg. magnitude of the error
        
        penalizes larger errors
        
        square of a prediction error is already large, it will get even larger and increase RMSE by a lot
    - - Predicts continuous/numeric values of dependent variable