Chapter 3: Predictive Modeling

Supervised Segmentation

Have a target quantity

Finding or selecting important informative variables

How can we segment the population with respect to something we would like to predict or estimate? (groups that differ from each other)

"Which accounts have been defrauded?"

"Which customers are likely to not pay off their account balances?"

Classification Tree Induction

Predictive Model

A formula for estimating the unknown value of interest: the target

Prediction

To estimate an unknown value

Descriptive Model

To gain insight into the underlying phenomenon or process

Judged on its intelligibility

Judged solely on its predictive performance

Value comes from the understanding gained from looking at the model rather than its predictions

Vocabulary

Supervised Learning

Model describes a relationship between a set of selected variables & a predefined variable (target)

Estimates value of target variable as a function of the features

Instance / Example

Represents a fact or a data point

Described by a set of attributes (fields, columns, variables or features)

Also called: Row or Feature Vector or case

Dataset

Also called: Table or worksheet

Contains a set of examples or instances

Features

Table columns

Also called: Independent varies or Predictors or Explanatory Variable

Target Variable

Values that are to be predicted

Also called: Dependent Variable

Model Induction

Creation of models from data

Induction Algorithm or Learner

The procedure that creates the model from the data

Have variants that induce models both for classification and for regression

Deduction

Contrasted from Induction

Starts with general rules & specific facts & creates other specific facts from them

Training Data

Input data for induction algorithm

Also called: Labeled data

Value for the target variable (label) is known

Try to segment the population into subgroups with different values for the target variable

How can we judge if variable contains important information about the target variable?

Selecting Informative Attributes

"head-shape: square, circular" "body-shape: rectangular, oval"

"write-off: yes, no"

Distinguish write-offs from non write-offs

Pure

Homogeneous with respect to the target variable

If every member of a group has the same value for the target, then the group is pure

Make segments as pure as possible

Attributes rarely split a group perfectly

Not all attributes are binary

Many have 3 or more distinct values

Some attributes are numeric values

Purity Measure

Entropy

Measure of disorder that can be applied to a set

Information Gain

Based on Entropy

Splitting criterion

Properties

Disorder corresponds to how mixed (impure) the segment is with repeat to properties of interest

Write-Offs

Entropy = -p1 log(p1) -p2 log(p2)...

pi = probability or relative percentage of property i within the set

pi = 1 When all member of the set have property i (max) pi = 0 When no members of the set have property i (min)

An attribute segments a set of instances into several subsets

How impure one individual subset is

Measures how much an attribute improves (decreases entropy)

Measures the change in entropy due to any amount of new information being added

IG = entropy - [p(c1) x entropy(c1) + p(c2) x entropy(c2) + .....

Used to produce a multiple attribute supervised segmentation

Interior Node

Test of an attribute

Branch from the Node

Distinct value or range of values

Terminal Node

Leaf

Leafs correspond to a segment & contains instances that tend to belong to the same class

Contains a value for the target variable

Each leaf contains a classification for its segment

Decision Nodes

Non-leaf nodes

Gives a class prediction

Root Node

Start here& defend through interior nodes

End with non-write off & write-off

Divide & conquer approach

Instance Space

Space described by the data features

Instance Space Visualization

Scatterplot to detect correlations & relationships

Can only visualize segmentations in 2 or 3 dimensions at once