Please enable JavaScript.
Coggle requires JavaScript to display documents.
Descriptive + Predictive ML - Coggle Diagram
Descriptive + Predictive ML
Association rule
Unsupervised Predictive
INPUT: Transaction itemset
item (ex: x= {coffee,bagel}, x is a 2-item itemset)
OUTPUT: association rules
Evaluation equations
1) support: not directional
Support count
Support %
2) Confidence: strength of direction
(x,y) / #(x)
3) Lift: evaluates chance (not directional)
LIFT >1 = Customer who bought X is more likely to buy Y
4) Apriori Algorithm: finding frequent itemsets
If itemset X is not frequent then anything containing X cannot be
Classification
Supervised Predictive
Step 1 Split
Randomly split data into 3: training, validation, test
choose a classification model + train
Step 2 Model validation
refine trained model on validation data
Step 3 Model testing
test accuracy on test data
Good? classify new data
KNN
for a given observation, identify nearest observations
Distance metrics : often euclidean
decide K (# neighbors)
determine normalization (AFTER SPLIT)
Compute distance
Pick K nearest neighbors
Choosing K
Overfitting - sensitive to outliers
underfitting - may miss structure of local data
choose lowest error rate
Decision Tree
Combines numeric + binary data
Start at top - ROOT node
work down : internal or decision node
Bottom: LEAF node (contains class label)
Unbalanced data: Confusion matrix
error rate : (FN + FP) / (TP + TN + FP + FN )
Precision: Horizontal
accuracy : (TP + TN) / (TP + TN + FP + FN )
Clustering algorithms
Hierarchical
forming large clusters from small ones
Start with closest individual points
Dendrogram
length of lines is proportional to distance
cut off: max distance we want between 2 clusters
Partition clustering
Partition data into K groups
1) algorithm chooses k points at random
2) assign other data points to which they are closest
repeat step 2 and reassign points to closest cluster
algorithm stops when cluster variance is minimized
preprocessing*
deal with missing values
remove outliers
normalize
Evaluating clusters
BSS
compute pairwise distance between cluster + global centroid
High BSS = low intersimilarity
WSS
add sqrd errors of all data points in a cluster
repeat for all clusters + add
Low WSS = high intrasimilarity
Choosing K
Eyeballing: look at dendrogram
Elbow point
Clustering
Unsupervised Predictive
Input
Observations: Rows
Dimensions: Columns
Output: Groups
High intra + Low intersimilarity
Evaluating output
Distance metrics
Numerical
Euclidean - shortest distance between two points
Manhattan - sum of the absolute different between 2 coordinates
Max Coordinate - find absolute max difference
Binary
Matching distance - number of mismatches / # total attributes
Jaccard - excludes matches where Noo
Categorical
matching approach - (K - M) / K
Taxonomy approach - using industry standard product heirarchy
translation approach - ex : distance between city names converted to #
Data normalization
min-max (x-min) / (max-min)
Standardization (zscore) - (x-mean) / (stdev)
Regression
Supervised Predictive
Input
Observations
X - predictors
Y - outcome
Steps
training, validation, test
1) model construction (training data)
2) model testing (test data) *accuracy
3) new data prediction
Regression Trees
decision tree - classify based on decision rules
SSD at given node (NOT entropy)
Avg of outcome values = final node
Split rule = lowest possible SSD at each data point
0 = pure
KNN for regression
Identify K nearest neighbors
Decide K to minimize RMSE
New prediction uses avg nearest neighbors
Can use weighted avg
Evaluating performance
Prediction error = predicted value - actual value
predicted > actual = overprediction
Predicted< actual = underpredicted
mean error
Positive = over
negative = under
Mean absolute error
Magnitude of error in any direction
RMSE
Avg magnitude of error
Total sum sqd error
square prediction errors
MAPE % error
% indicates deviation of predictions