Please enable JavaScript.
Coggle requires JavaScript to display documents.
DATA SCIENCE FOR BUSINESS 3 (PREDICTIVE MODELING - FROM CORRELATION TO…
DATA SCIENCE FOR BUSINESS 3 (
PREDICTIVE MODELING
- FROM
CORRELATION
TO
SUPERVISED SEGMENTATION
)
SUPERVISED SEGMENTATION
Segment population into groups
-> that differ from each other with respect to
something we would like to predict or estimate
FIND / SELECT IMPORTANT - INFORMATIVE
VARIABLES
OR
ATTRIBUTES
OF THE ENTITY
information is a quantity that reduces uncertainty about something
TARGET = DEFINED
(supervised)
We would like to
PREDICT / UNDERSTAND BETTER
REDUCES UNCERTAINTY
FIND ATTRIBUTES CORRELATE WITH TARGET
SELECT A SUBSET OF DATA IN LARGE DATABASE
MODEL
: simplified representation of reality created to serva a purpose
ASSUMPTIONS
CONSTRAINTS
EX: a MAP is a model of the physical world. It abstract a lot of information that is irrelevant for its purpose
PREDICTIVE MODEL
: formula for estimating the unknown value the target
PREDICTION
ESTIMATING A UNKNOWN VALUE
SOMETHING IN THE FUTURE
SOMETHING IN THE PRESENT
(USING PAST DATA)
Judged for its PREDICTIVE PERFORMANCE (but INTELLIGIBILITY is important)
DESCRIPTIVE MODEL
: insight into the phenomenon or process (assess the present data)
Judged for its ACCURACY but also INTELLIGIBILITY
MODEL INDUCTION: creation of models from data
DATASET
=
TABLE of DATABASE =
WORKSHEET of a SPREADSHEET
EXAMPLES / INSTANCES
= ROW
FEATURES
= COLUMNS
INDEPENDENT VARIABLES
=
PREDICTORS =
EXPLANATORY VARIABLES
DEPENDENT VARIABLES
= TARGET VARIABLE
INPUT DATA
FOR THE INDUCTION ALGORITHM =
TRAINING DATA
SELECT
ONE / MORE ATTRIBUTES / FEATURES /
VARIABLES
that will best
DIVIDE THE SAMPLE WITH RESPECT TO OUR TARGET VARIABLE OF INTEREST
SUPERVISED SEGMENTATION
SEGMENT
THE POPULATION INTO
SUBGROUPS WITH DIFFERENT VALUES FOR THE TARGET VARIABLE
EX:
“Mid-dle-aged professionals who reside in New York City on average have a churn rate of 5%.”
DEFINITION OF THE SEGMENT WITH SOME ATTRIBUTES
= middle-aged professionals who reside in NewYork City
PREDICTED VALUE OF THE TARGET VARIABLE
= churn rate of 5%
HOW CAN WE JUDGE WHETHER A VARIABLE CONTAINS IMPORTANT INFORMATION ABOUT THE TARGET VARIABLE?
rank the variables
by how good they are at predicting the value of the target
automatically get a selection of the more informative variables
with respect of a particular task
SELECTING INFORMATIVE ATTRIBUTES
EXAMPLE
BINARY CLASSIFICATION PROBLEM
VALUE OF TARGET VARIABLE
1 more item...
PREDICTOR ATTRIBUTES
4 more items...
just one application of selecting informative variables