Chapter 3: Predictive Modeling
Supervised Segmentation
Have a target quantity
Finding or selecting important informative variables
How can we segment the population with respect to something we would like to predict or estimate? (groups that differ from each other)
"Which accounts have been defrauded?"
"Which customers are likely to not pay off their account balances?"
Classification Tree Induction
Predictive Model
A formula for estimating the unknown value of interest: the target
Prediction
To estimate an unknown value
Descriptive Model
To gain insight into the underlying phenomenon or process
Judged on its intelligibility
Judged solely on its predictive performance
Value comes from the understanding gained from looking at the model rather than its predictions
Vocabulary
Supervised Learning
Model describes a relationship between a set of selected variables & a predefined variable (target)
Estimates value of target variable as a function of the features
Instance / Example
Represents a fact or a data point
Described by a set of attributes (fields, columns, variables or features)
Also called: Row or Feature Vector or case
Dataset
Also called: Table or worksheet
Contains a set of examples or instances
Features
Table columns
Also called: Independent varies or Predictors or Explanatory Variable
Target Variable
Values that are to be predicted
Also called: Dependent Variable
Model Induction
Creation of models from data
Induction Algorithm or Learner
The procedure that creates the model from the data
Have variants that induce models both for classification and for regression
Deduction
Contrasted from Induction
Starts with general rules & specific facts & creates other specific facts from them
Training Data
Input data for induction algorithm
Also called: Labeled data
Value for the target variable (label) is known
Try to segment the population into subgroups with different values for the target variable
How can we judge if variable contains important information about the target variable?
Selecting Informative Attributes
"head-shape: square, circular" "body-shape: rectangular, oval"
"write-off: yes, no"
Distinguish write-offs from non write-offs
Pure
Homogeneous with respect to the target variable
If every member of a group has the same value for the target, then the group is pure
Make segments as pure as possible
Attributes rarely split a group perfectly
Not all attributes are binary
Many have 3 or more distinct values
Some attributes are numeric values
Purity Measure
Entropy
Measure of disorder that can be applied to a set
Information Gain
Based on Entropy
Splitting criterion
Properties
Disorder corresponds to how mixed (impure) the segment is with repeat to properties of interest
Write-Offs
Entropy = -p1 log(p1) -p2 log(p2)...
pi = probability or relative percentage of property i within the set
pi = 1 When all member of the set have property i (max) pi = 0 When no members of the set have property i (min)
An attribute segments a set of instances into several subsets
How impure one individual subset is
Measures how much an attribute improves (decreases entropy)
Measures the change in entropy due to any amount of new information being added
IG = entropy - [p(c1) x entropy(c1) + p(c2) x entropy(c2) + .....
Used to produce a multiple attribute supervised segmentation
Interior Node
Test of an attribute
Branch from the Node
Distinct value or range of values
Terminal Node
Leaf
Leafs correspond to a segment & contains instances that tend to belong to the same class
Contains a value for the target variable
Each leaf contains a classification for its segment
Decision Nodes
Non-leaf nodes
Gives a class prediction
Root Node
Start here& defend through interior nodes
End with non-write off & write-off
Divide & conquer approach
Instance Space
Space described by the data features
Instance Space Visualization
Scatterplot to detect correlations & relationships
Can only visualize segmentations in 2 or 3 dimensions at once