Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 3: Introduction to Predictive Modeling: From Correlation to…
Chapter 3: Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Models, Induction, and Prediction
Predictive model: is a formula for estimating the unknown value of interest: the target
Formula: Could be a logical statement or mathematical or a hybrid
Two types: classification models and regression models
Contrast: Descriptive Modeling: primary purpose of the model is not to estimate a value but instead to gain insight into the underlying phenomenon or process
Predictive Model: may be judge solely on its predictive performance
Supervised Learning: Model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable
Model estimates the value of the target variable as a function of the features
Vocab:
Instance or example: represents a fact or a data point
Instance is described by a set of attributes also called a feature vector
Model Induction: the creation of models from data
Training Data: The input data for the induction algorithm, used for inducing the model'
They are called LABEL data because the value for the target variable (the label) is known
Supervised Segmentation
If the segmentation is done using the variables that will be known when the target is not, then these segments can be used to predict that value of the target variable
Supervised Segmentation with Tree-Structured Models
Tree is made up of nodes, interior nodes, and terminal nodes, and branches emanating from the interior nodes
Following the branches from the root node down, each path eventually terminates at a terminal node, or leaf
Tree = supervised segmentation, because each leaf contains a value for the target variable, such a tree is called a classification tree or a decision tree
Selecting Informative Attributes:
Once you choose the attributes you segment people into groups, in a way that will distinguish write-offs from non-write-offs
Want the resulting groups to be as PURE as possible
Pure= homogenous with respect to the target variable
Complications
1.) Attributes rarely split a group perfectly
2.) Sometimes data only splits off one single data point into the pure subset
3.) Not all attributes are binary
4.) Some attributes take on numeric values
Information Gain: Most common splitting criterion; it is based on a purity measure called entropy
Entropy: measure of disorder that can be applied to a set, such as one of our individual segments
In supervised segmentation, the member properties will correspond to the values of the target variable
Entropy-based information gain is based on the distribution of the properties in the segmentation
Natural measure of impurity for numeric values is variance
Visualizing Segmentations
Common form of instance space visualization is a scatterplot on some pair of features, used to compare one variable against another to detect correlations and relationships
Instance space in a few dimensions is useful for understanding the different types of models because it provides insights that apply to higher dimensional spaces
Trees as Sets of Rules
Classify a new unseen instance by starting at the root node to a leaf, collecting the predicted class. If we trace down a single path from the root node to a leaf, collecting conditions as we go, we generate a rule
Every classification tree can be expressed as a set of rules this way
Probability Estimation
Tree induction can easily produce probability estimation trees instead of simple classification trees
Example: if a leaf contains n positive instances and m negative instances, the probability of any new instances being positive may be estimated as n/(n+m) -> this is called frequency-based estimate class membership probability
Overfitting'
Instead of simply computing the frequency, we would often use a "smoothed" version of the frequency-based estimate, known as LAPLACE CORRELATION
EQUATION:
N = number of examples in the leaf belonging to class c, and m is the number of examples not belonging to class c
Extra
Example: Supervised segmentation: how can we segment the population into groups that differ from each other with respect to some quantity of interest.
Informative: information is a quantity that reduces uncertainty about something
Tree Induction: finding informative attributes also is the basis for a widely used predictive modeling technique
incorporates the idea of supervised segmentation in an elegant manner