Data Science for Business
By: Fawcett & Provost


CH 3: "Introduction to Predictive Modeling: From Correlation to Supervised Segmentation"

Models, Induction, and Prediction

Prediction: to forecast
a future event

Induction: refers to generalizing
from specific cases to
general rules

Model: a simplified representation of reality created to serve a purpose

In data sicence, a predictive model is a formula for estimating the unknown value of interest: the target

Characteristics: mathematical, logical statement,
or a hyrbid of the two

In data science, generally means
to estimate an unknown value

Characteristics: attributes,
feat ure vector,
instances

While Deduction: starts w/ general rules
and specific facts, and creates other
specific facts from them

Characteristics: training data (labeled data),
induces models for both classification and
for regression

Supervised Segementation

Segment population into
subgroups

Fundamental Concept: How can we judge
whether a variable contains important info.
about the target variable?

Select informative
variables through direct,
multivariate supervised
segementation

Selecting Informative Attributes

Examine attributes
and target variable

We want pure, or homogeneous,
resulting groups w/ respect to
the target variable

Complications:

Attributes rarely split
a group perfectly

Some attributes take on
numeric values (continuous
or integer)

Not all attributes
are binary

Common Splitting
Criterion

Information Gain

Based on purity measure,
entropy

Entropy: a mesure of disorder
applied to a set

Write-offs and non-write-offs

Measure how informative
an attribute is

Fundamental Concept:
a natural measure of
impurity for numeric values
is variance

Ex) Attribute selection
w/ Information Gain

Calculate info. gain and
illustrate entropy (w/ shading)

GOAL: as little shaded = lowest entropy

Supervised Segementation
w/ Tree-Structured Models

Tree is upside down
w/ root at the top

Tree is made up of
nodes (interior and exterior)

Interior nodes
contain attributes

Each path ends at
a terminal node, or leaf

Tree is supervised b/c
each leaf contains a
classification for its segment

Calssification tree, or more
loosely, decision tree

Most data mining packages
include some type of tree
induction technique

"Divide and conquer"
approach

GOAL: select an attribute
to split into subgroups
at each step

Visualizing Segmentations

Only possible to visualize
segementations in 2 or 3
dimensions at once

Useful in understanding
the different types of models

Sets of Rules

Tree gathers common rule
prefixes together toward
the top of the tree

As model grows, people
can prefer the tree or rule set

Probability Estimation

Results in more sophisticated
decision-making processes

Could create "overfitting"
fundamental issues

Laplace correction