Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 15: Build Candidate Models (Starting the Process (Option of which…
Chapter 15: Build Candidate Models
Starting the Process
Select target feature
Distribution of target feature is displayed
DataRobot automatically down samples the majority case
Option of which metric to optimize
DataRobot usually trusted,might want to check
LogLoss (Accuracy)
Model is evaluated on probabilities generated and distance from correct answer
Advanced Options
Partitioning Design
Random:
20% of data put in "lock-box" to be tested right before system goes to production
Stratified
Works harder to maintain same list. of target values
Once lock-box has been filled, remaining cases are split into n folds
More folds = drawbacks
Less reliable
DataRobot makes consequential decisions based on first validation fold
5 most common fold number
Distribution between various samples split into 5 folds
Specifies cross-validation scoores are the average of the scores for the number of models used
Partition Feature
method for fining exactly which cases are used in differed folds
User must do own random or semi random assignment of cases
Group Approach
Allows for specification of group membership feature
DataRobot makes decisions about where a case is to be partitioned but always keeps each group together
Date/Time option deals with critical evaluation issue
Making sure all validation cases occur in a time after the time of the cases
Starting Analytical Process
Autopilot
Quick
Manual
Steps
Step 1: Setting Target Feature
transfers user's target feature decision
Step 2: Creating CV and Holdout partitions
Uses decisions we made in advanced settings
Step 3: Characterizing target variable
DataRobot will save the distribution of the target to analysis system
Used for later use in decisions about which models to run
Step 4: Loading and preparing data
Relevant:
dataset is large (over 500)
All the initial evaluations before step will be conducted
Step 5: Saving target and partitioning info
Actual partitions are stored in cross-validation folds and holdout sets are stored in a separate file on disk
Step 6: Importance scores are calculated
Step 7: Calculating list of models
inför form steps 3-6 is used to determine which blueprints to run in autopilot process
Importance
indicates relative importance of a particular feature when exempted against target independently of all other features
Model Selection Process
Running of almost every effective algorithm invented for machine learning
Assigned worker
Dedicated computer
Algorithms requiring more processing power will be assigned many more CPUs