Ch. 15 Build Candidate Models

Candidate Models: Construct of Data for Training Data

This Training Data will be used to determine which algorithm best fits the model

Select Target Variable - what you're trying to predict

focuses on delta of correct answer - counts in error

Optimizes this error

although you can customize by client how you define optimization

Methods for Data Selection: Random v Stratified Data

Random: takes random 20% for a Hold out sample

Stratified: Optimizes Distribution in holdout sample

holdout sample = fixed portion of data being held to use in real world predictions

After holdout sample, then remainder we use folds on data

Fold and 1st Fold especially have a domino effect on model (5 is golden Number, ex [10k - 2k] = 8k/5 = 1.6k )

4 of 5 splits are used to TRAIN model 1 of 5 are used to VALIDATE model

TARGET LEAK can happen when partioning // cross validating data. Ths will cause inaccurate predictions (ex Weatherman)

Quick: auto predicts behavior

LogLos is default option for preditction

Informative Features: excludes things like dupes or too few

Steps for Creating Predictive Model:

Setting Target Variable (market neutral portfolio in CQA)
Creating CV and Holdout Partitions (~Parts - Deciles or Quintiles in CQA)
Characterizing Target variable (optimizing max return and min risk in CQA)
loading data set and prepping data (FNCE and utilities & factor decile choice in CQA)
saving target and partitioning information
Analyzing features (Backetesting results in CQA)
Calculating List of Models (Multiple backtests in CQA)

Calculating List of Models:
Sorts models by importance to HOW WELL IT PREDICTS on R^2 or linear regression (Factor correlation to risk and return in CQA)

Even works on Non-Linear relationships

MODEL SELECTION PROCESS

Most ML Algos reside in Amazons Cloud

The reason is because we need MULTIPLE computers (RAM) to be running these algos

TOURNAMENTS of ALGOS // MODELS (spearman correlation in CQA)

Cross Validating: Referencing data on another set to be able to determine IF this can be generalized to INDEPENDENT SET OF DATA

Algos create predictive models and these models are then placed in ranking order (least error = best)

Random forrest classifier: helps with training data as it contains 1000's of decision trees

Round 1: 16%
Round 2: 32%
Round 3: 64%
Round 4: 64% CV
Round 5: Blending

Blending: cross validating and optimizing the models against each other to find the best result of predictability