Ch. 15 Build Candidate Models
Candidate Models: Construct of Data for Training Data
This Training Data will be used to determine which algorithm best fits the model
Select Target Variable - what you're trying to predict
focuses on delta of correct answer - counts in error
Optimizes this error
although you can customize by client how you define optimization
Methods for Data Selection: Random v Stratified Data
Random: takes random 20% for a Hold out sample
Stratified: Optimizes Distribution in holdout sample
holdout sample = fixed portion of data being held to use in real world predictions
After holdout sample, then remainder we use folds on data
Fold and 1st Fold especially have a domino effect on model (5 is golden Number, ex [10k - 2k] = 8k/5 = 1.6k )
4 of 5 splits are used to TRAIN model 1 of 5 are used to VALIDATE model
TARGET LEAK can happen when partioning // cross validating data. Ths will cause inaccurate predictions (ex Weatherman)
Quick: auto predicts behavior
LogLos is default option for preditction
Informative Features: excludes things like dupes or too few
Steps for Creating Predictive Model:
- Setting Target Variable (market neutral portfolio in CQA)
- Creating CV and Holdout Partitions (~Parts - Deciles or Quintiles in CQA)
- Characterizing Target variable (optimizing max return and min risk in CQA)
- loading data set and prepping data (FNCE and utilities & factor decile choice in CQA)
- saving target and partitioning information
- Analyzing features (Backetesting results in CQA)
- Calculating List of Models (Multiple backtests in CQA)
- Calculating List of Models:
Sorts models by importance to HOW WELL IT PREDICTS on R^2 or linear regression (Factor correlation to risk and return in CQA)
Even works on Non-Linear relationships
MODEL SELECTION PROCESS
Most ML Algos reside in Amazons Cloud
The reason is because we need MULTIPLE computers (RAM) to be running these algos
TOURNAMENTS of ALGOS // MODELS (spearman correlation in CQA)
Cross Validating: Referencing data on another set to be able to determine IF this can be generalized to INDEPENDENT SET OF DATA
Algos create predictive models and these models are then placed in ranking order (least error = best)
Random forrest classifier: helps with training data as it contains 1000's of decision trees
Round 1: 16%
Round 2: 32%
Round 3: 64%
Round 4: 64% CV
Round 5: Blending
Blending: cross validating and optimizing the models against each other to find the best result of predictability