Data Science for Business
By: Fawcett & Provost
CH 3: "Introduction to Predictive Modeling: From Correlation to Supervised Segmentation"
Data Science for Business
By: Fawcett & Provost
CH 3: "Introduction to Predictive Modeling: From Correlation to Supervised Segmentation"
Models, Induction, and Prediction
Prediction: to forecast
a future event
Induction: refers to generalizing
from specific cases to
general rules
Model: a simplified representation of reality created to serve a purpose
In data sicence, a predictive model is a formula for estimating the unknown value of interest: the target
Characteristics: mathematical, logical statement,
or a hyrbid of the two
In data science, generally means
to estimate an unknown value
Characteristics: attributes,
feat ure vector,
instances
While Deduction: starts w/ general rules
and specific facts, and creates other
specific facts from them
Characteristics: training data (labeled data),
induces models for both classification and
for regression
Supervised Segementation
Segment population into
subgroups
Fundamental Concept: How can we judge
whether a variable contains important info.
about the target variable?
Select informative
variables through direct,
multivariate supervised
segementation
Selecting Informative Attributes
Examine attributes
and target variable
We want pure, or homogeneous,
resulting groups w/ respect to
the target variable
Complications:
Attributes rarely split
a group perfectly
Some attributes take on
numeric values (continuous
or integer)
Not all attributes
are binary
Common Splitting
Criterion
Information Gain
Based on purity measure,
entropy
Entropy: a mesure of disorder
applied to a set
Write-offs and non-write-offs
Measure how informative
an attribute is
Fundamental Concept:
a natural measure of
impurity for numeric values
is variance
Ex) Attribute selection
w/ Information Gain
Calculate info. gain and
illustrate entropy (w/ shading)
GOAL: as little shaded = lowest entropy
Supervised Segementation
w/ Tree-Structured Models
Tree is upside down
w/ root at the top
Tree is made up of
nodes (interior and exterior)
Interior nodes
contain attributes
Each path ends at
a terminal node, or leaf
Tree is supervised b/c
each leaf contains a
classification for its segment
Calssification tree, or more
loosely, decision tree
Most data mining packages
include some type of tree
induction technique
"Divide and conquer"
approach
GOAL: select an attribute
to split into subgroups
at each step
Visualizing Segmentations
Only possible to visualize
segementations in 2 or 3
dimensions at once
Useful in understanding
the different types of models
Sets of Rules
Tree gathers common rule
prefixes together toward
the top of the tree
As model grows, people
can prefer the tree or rule set
Probability Estimation
Results in more sophisticated
decision-making processes
Could create "overfitting"
fundamental issues
Laplace correction