Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 3: Introduction to Predictive Modeling: From Correlation to…

- - - - Contrast: Descriptive Modeling: primary purpose of the model is not to estimate a value but instead to gain insight into the underlying phenomenon or process
        
        Predictive Model: may be judge solely on its predictive performance
  - - - Vocab:
        
        Instance or example: represents a fact or a data point
        
        Instance is described by a set of attributes also called a feature vector
      - Model Induction: the creation of models from data
        
        Training Data: The input data for the induction algorithm, used for inducing the model'
        
        They are called LABEL data because the value for the target variable (the label) is known
  - - - Tree is made up of nodes, interior nodes, and terminal nodes, and branches emanating from the interior nodes
        
        Following the branches from the root node down, each path eventually terminates at a terminal node, or leaf
      - Tree = supervised segmentation, because each leaf contains a value for the target variable, such a tree is called a classification tree or a decision tree
  - - - Want the resulting groups to be as PURE as possible
        
        Pure= homogenous with respect to the target variable
        
        Complications
        
        1.) Attributes rarely split a group perfectly
        
        2.) Sometimes data only splits off one single data point into the pure subset
        
        3.) Not all attributes are binary
        
        4.) Some attributes take on numeric values
        
        Information Gain: Most common splitting criterion; it is based on a purity measure called entropy
        
        Entropy: measure of disorder that can be applied to a set, such as one of our individual segments
        
        In supervised segmentation, the member properties will correspond to the values of the target variable
        
        Entropy-based information gain is based on the distribution of the properties in the segmentation
        
        Natural measure of impurity for numeric values is variance
- - - - Overfitting'
        
        Instead of simply computing the frequency, we would often use a "smoothed" version of the frequency-based estimate, known as LAPLACE CORRELATION
        
        EQUATION:
        
        N = number of examples in the leaf belonging to class c, and m is the number of examples not belonging to class c
- - - - Tree Induction: finding informative attributes also is the basis for a widely used predictive modeling technique
        
        incorporates the idea of supervised segmentation in an elegant manner