Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 15: Building Candidate Models (Starting the Analytical Process (7…
Chapter 15: Building Candidate Models
Starting the Process
Select target feature
"Use as Target"
Distribution is displayed
3 set of choices to make
DR offers the option of which metric to optimize the produced models for and is a good source to trust
LogLoss (Accuracy) = model is evaluated based on probabilities generated and their distance from the correct answer
Advanced options
Run models using: Cross-Validation vs. Train-Validation Holdout
CV = reminder of the distribution between various samples split into 5 folds and a holdout & specifies that scores are the average of scores for the number of models used
Select partitioning options (Stratified, random, group, date/time, partition feature)
Date/time = making sure that all validation cases occur in a time period after the time of the cases used to create models
Select holdout & lock-box and remaining cases are split into n folds
Group = allows specification of a group membership feature
Partition Feature = determine exactly which cases are used in different folds
Stratified works a bit harder to maintain the same distribution of target values inside the holdout as the other samples
Starting the Analytical Process
Data prepared though: Autopilot, Quick, and Manual
Make sure: stratified partitioning method w/ 5-fold cross validation & 20% holdout sample
Informative Features = represents all the data features except for ones excluded and tagged like [Duplicate] or [Too Few Values
7 step process
1: Setting target feature
2: Creating CV and holdout partitions
Characterizing target variable
Loading dataset and preparing data (if dataset is large)
Saving target and partitioning information (where actual partitions are stored in cross validation folds and holdout sets are stored
Analyzing features
Calculating list of models (where info is used to determine which blueprints to run in the autopilot process
Importance = indicates the relative importance of a particular feature when examined against the target independently of all other features
Model Selection Process
Each different type of algorithm has very different run-times and processing needs & each assigned to a "worker" (computer)
Think of DR Autopilot as a 5 round tournament to determine the best approach and algorithm (blueprints)
Round 1: 16% Sample (make decisions about algorithm selection for further participation)
Ex: RandomForest Classifier (Gini)
Round 2: 32% Sample (smaller set of algorithms)
Round 3: 64% Sample (even smaller set of algorithms, run again, and are cross validated)
Round 4: Cross Validation (Run full CV process on top models)
Round 5: Blending (After CV, models are scored by scores, and then blended)
BP3 = notes the specific process flow that was implemented for the algorithm