Please enable JavaScript.
Coggle requires JavaScript to display documents.
Overfitting and Its Avoidance (Overfitting (Examples (Handout Data (Data…
Overfitting and Its Avoidance
Generalization
Property of a modeling process
Model applies to data that was not used to build the process (applies to non-historical data)
Training data may not reflect real data
Overfitting
Tendency in data mining to tailor to training data
All procedures have tendency to overfit
Examples
Handout Data
Data held out because the value of the target variable is known
Hide the target values and have model predict the values
Estimates the generalization performance by comparing predicted values with hidden values
Cross-Validation
Uses mean and variance to predict forecasted data
Fitting Graph
Shows accuracy of model based on complexity
Overfitting in Tree
Compares accuracy based on amount of tree nodes until holdout data begins to overfit
Apply with mathematical function: f(x) = w0 + w1x1 + w2x2...
Apply with linear functions
Learning Curves
Performance improves as more data is tested
Overfitting Avoidance & Complexity Control
Avoiding overfitting in trees
Stop growing before over complex
Specify minimum number of instances
Overgrow then cut back and trim the excess
General Methods
Understand importance/relevance of initial data set
Nested holdout testing
Parameter Optimization
Regularization is trying to optimize a combination of fit to the data and simplicity of the model