Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Analytics - Coggle Diagram
Data Analytics
Descriptive Analytics/Unsupervised Learning
Association Rule
Is the output good?
Lift (not directional)
Lift of (X --> Y)
Result
Lift = 1 = customers who buy X are as likely to buy Y as any other customer
Lift > 1 = customers who buy X are more likely to buy Y (complimentary)
Lift < 1 = customers who buy X are less likely to buy Y (substitute)
S(X --> Y) / S(X) * S(Y)
P(X, Y) / P(X)P(Y)
Could the rule be just by chance?
Pure Chance
90% of customers buy X
90% of customers buy Y
By pure chance .9 * .9 = 81% of transactions include itemset {X, Y}
Support (not directional)
Support Percentage
of a single itemset
of a rule
How frequent/common is the rule?
Support Count
of a single itemset
of a rule
Confidence (directional)
Confidence of Y --> X
How confident are we about the direction of the rule?
Confidence of X --> Y
Input
dataset/transaction
Itemset = subset of U
Instance of an itemset
Example: X = {Coffee, Bagel} = 2-item itemset
Any combo of existing items in U
U = universal set of items
Output
Association rules
If X then Y
Relationship between itemsets NOT items
X --> Y
One-directional
X & Y canNOT overlap
Process
Problem: with N items, there are 2 to the n power potential itemsets
Apriori Algorithm
Steps
Check all 1-item itemsets & keep only frequent ones
Check all 2-item itemsets from prev. step, keep only frequent ones
Keep going until you checked frequent itemsets of all sizes
Generate all association rules
Find rules where confidence >= minconf
Only consider high support
Find all itemsets support >= minsupp a.k.a. frequent itemsets
If X is NOT frequent, then any larger itemset containing X is NOT frequent
Clustering
Algorithms
Hierarchical Clustering
Process
Prep
Distance between points
Distance matrix
Each data point as 1-point cluster
Shortest distance (inter-similarity) points merged
Merge 2 next-closest clusters
Repeat
Dendrogram
Reading It
Best K-Clusters Solution
Repeat
Cut-off
Best 2-Clusters Solution
Best K-Clusters Solution
Application
Which one to choose?
Hierarchical
Cons
Computationally demanding
Pros
Data-driven
Solutions to any # of clusters
K-Means
Pros
Less computationally demanding
Cons
Poor initialization = bad results
Not good for irregular shapes, noisy data or clusters with different densities
Partitioning-based Clustering
Process
decide cluster numbers
Algorithm pick k data-points at random
Assign all other data to the closer centroid
Update centroid based on new clusters
Repeat step 2
Repeat step 3
Keep repeating until convergence
Application
Evaluating Clusters
Within Sum of Squared Errors (WSS)
Process
We have K clusters
Centroid for each cluster is m
For X in cluster C the error = distance to its own cluster's center
Sum for all data points in a cluster
Repeat for all clusters
Sum them all together
Interpreting
Used to compare different partitions for the same dataset
Lower WSS = Higher intra-similarity
Clusters are more cohesive within
Between Sum of Squared Errors (BSS)
Process
We have K clusters
Centroid for each cluster is m
Centroid for all points is m*
Distance between cluster centroid (m) & global centroid (m*)
Sum distances (weighted by # of points in the cluster)
Interpreting
Higher BSS = Lower inter-similarity
Used to compare different partitions for the same dataset
How to choose K?
No "right" answer BUT there are better answers
Eyeballing
Elbow Test
Basics
Input
dataset of n columns/features and m rows/records
Output
Clusters
What are good clusters?
high intra-similarity
low inter-similarity
Distance Metrics
Numerical Data
Max-coordinate
the max among the absolute differences
Euclidean
shortest distance between 2 points
Manhattan
sum of the absolute difference of their coordinates
Binary Data
Matching
number of mismatches divided by the total number of attributes (k)
range is always [0,1]
higher = more distant
Used for symmetric data
Jaccard
excludes matches where a = 0 and b = 0
Used for asymmetric data
Categorical Data
Matching
Taxonomy based approach
using industry-standard product hierarchy
Translation-based approach
Distance between 2 city names can be replaced by geographical distance
Data Normalization
Min-Max
rescaled all values between 0 and 1 using the mind and max
(x - min) / (max - min)
Standardization/Z-score
transforms the data to have a mean of 0 and a standard deviation of 1
(x - sample mean) / (sample standard deviation)
Predictive Analytics/Supervised Learning
Classification
Algorithms
k-NN/Nearest Neighbors
Choosing k
accuracy = 1 - error rate
lowest error rate in testing data
error rate = % of misclassified observations
Trade-offs
Overfitting
small k values
sensitive to noise
sensitive to outliers
Underfitting
large k values
miss local structure of data
Measure
Distance metrics (ex. Euclidian)
decide k = how many neighbors we consider?
Key Idea
identify k = "nearest"/similar observation
nearest neighbors "vote" for their own class
majority = predicted class
Procedure
Pick K = how many neighbors we will consider
Pick distance measure + should we normalize?
Calculate Euclidian distance between new & existing
Pick the k-nearest neighbors
Assign to majority class
Pros
Easy to implement/use
Does NOT require assumptions
Cons
"Lazy" classifier
No "real" model building
Bad for large dataset
Slow learner - bad for real-time prediction
Decision Tree
Cons
Unstable: slight change in data = very different split
Splits are one attribute at a time = miss interesting relationship
Pros
Robust to outlier
Good for variable selection
Do NOT require assumptions
Purity/Entropy
Binary data
Impure = when a leaf contains true & false for outcome variable
lower entropy = purer
lower entropy = more observations in partition that have same value for outcome variable
weighted entropy = weighted avg of entropy for the leaves
Numeric data
sort the rows by age
Calculate avg age for all adjacent people
calculate weighted entropy values for each avg
Terminology
Leaf Node = final node
arrows to them
contains class label
Root Node = very top
arrows away from it
Internal Nodes/Branches/Decision Nodes = in betweens
arrows to them
arrows away from them
Building
Internal Nodes/Branches/Decision Nodes = in betweens
Construct all possible attribute tresholds
Pick the one with the largest information gain as the root node
Continue splitting based on information gain
The same attribute can be used again if splits do NOT overlap
Not all available attributes need to be used
Information Gain
= Entropy (Parent node) - Weighted entropy (Children node)
Largest = root
Can we reduce impurity by splitting the chosen root on other attributes? Repeat
Use majority rule to label the leaf nodes
Stopping Criteria
When all data points in a node are from the same class
There are no remaining attributes to split on
Steps
Randomly split data into
testing data
training data
validation data
Model Construction/Training Step
Choose classification model
Train it on training data
End up with a trained model
Model Validation/Validation Step
Refine/fine-tune trained model on validation data
Sometimes skip this step
End up with a refined trained model
Model Testing/Testing Step
Assess how accurate model is
Use evaluation metrics
Receive testing performance
Apply model to new real data
Evaluating
Unbalanced dataset
Accuracy is a bad measure
Some classes more important
Confusion Matrix
F1 - Score
Better than Accuracy for Unbalanced data
If we care about recall AND precision
= (2 x recall x precision) / (recall + precision)
Recall
Recall for negative class = Specificity = TN / (TN + FP)
For each ACTUAL class, how many were recovered?
Recall for positive class = Sensitivity = TP / (TP + FN)
Precision
For each PREDICTED class, how many did we get right?
Precision for positive class = Positive Predictive Value (PPV) = TP / (TP + FP)
Precision for negative class = Negative Predictive Value (NPV) = TN / (TN + FN)
Error Rate = (FN + FP) / total observations
Accuracy = (TP + TN) / total observations
Values
False Positive (FP): Predict YES, Actual NO
True Positive (TP): Predict YES, Actual YES
True Negative (TN): Predict No, Actual No
False Negative (FN): Predict NO; Actual YES
Basics
independent variables (x)
1 dependent variable (y)
predict well-defined outcome
binary/categorical
Methods
k-NN
Decision Tree
Regression
Input
Attributes of observation
Independent Variable
Attributes of an observation used to predict Y
X
covariates
predictors
Dependent Variable
numerical/continuous
Y
represents outcome
Observation/Record
Unit of observation defined by variable/available attributes
Split data like for classification
validation data
training data
testing data
Regression Tree
Basics
Sum of Squared Deviations instead of Entropy
decision tree for numerical outcomes
data are split based on values of attributes
Average of the outcome values (for leaf nodes)
Leaf Nodes' Class
Determined by the avg. of the outcome values of the data points in that node
Take the Y values of all observation in the leaf node
Sum these values
Divide the sum by the # of observations in the leaf node
Split Rule
No upper limit
Split data so it minimizes SSD
Lowest possible value 0 if all the outcomes in the partition have the exact same value
lower SSD = more "agreement" in the outcome
Splits are made using the sum of squared deviations (SSD) from the average outcome value at that node
Calculate the avg. of the outcome values of all the data points in the node aka miu
Take the Y value of each observation and subtract miu
Square the difference
Sum the squares
Pros
robust to outliers
no parametric assumptions
good for variable selection
Cons
unstable - slight change in the data can lead to very different splits
splits are done 1 attribute at a time --> miss interesting relationships between attributes
Output
Given new data we follow the rules of the tree
Regression tree discretize the outcomes
aka putting the outcomes into different segments
not as good as more detailed predictive outcomes
k-NN
Basics
might use a weighted avg, with weight decreasing with distance
need to decide k
for a given observation, find the k-nearest neighbors
neighbors = observations with shortest distance from new observation
numeric prediction = avg. of the nearest neighbor's outcomes
Pros
does not make assumptions
simple
effectively captures complex relationships without really building a model
Cons
Curse of dimensionality
lots of attributes = need lots of observation to have a good prediction
Slow
not good for "real time" predictions
especially bad with large dataset
Steps
pick k
pick the distance measure + should we normalize?
usually use standardization
if yes, normalize first, compute distance second
outcome variable NOT normalized
Compute the pairwise distance between new observation & all the rest
Pick the k-nearest neighbor (shortest distance)
Predictor = avg. of Ys of neighbors
Evaluation Performance
MAE = Mean Absolute Error
tells us the magnitude of the avg. error in any direction
direction of errors is lost (aka over/under-prediction)
absolute value of the error
sum them & take the avg.
negative numbers are transformed into positive
positive numbers stay positive
Prediction Error
difference between predicted outcome & actual outcome
predicted value - actual value
overprediction
predicted value > actual value
underprediction
predicted value < actual value
ME = Mean Error
avg. of prediction errors
not very informative
if positive
on avg. overpredicting
if negative
on avg, underpredicting
MAPE = Mean Absolute Percentage Error
Equation
absolute value of error
divide by actual value
sum them
take the avg
multiply by 100
No direction
relative to the actual values
gives % of how much predictions deviate from actual values
Total SSE = Total Sum of Squared Error
square of prediction error is already large, it will get even larger and increase total SSE by a lot
penalizes larger errors
square the prediction errors and sum
RMSE = Root Mean Squared Error
Equation
square the prediction errors
sum them
take the avg.
square root the whole thing
measures the avg. magnitude of the error
penalizes larger errors
square of a prediction error is already large, it will get even larger and increase RMSE by a lot
Basics
Predicts continuous/numeric values of dependent variable