Please enable JavaScript.
Coggle requires JavaScript to display documents.
ML Model development - Coggle Diagram
ML Model development
- Training and test dataset
-
Test Set
Test Set
- for final evaluation & generalization error
-
Other techniques to achieve good fit model:
- Early-stopping
- Regularization tech
- Ensemble methods
- Model Evaluation
Report performance of classifier
Regression Metrics
Just like anything else in ML, there is not a clear line on what to use. Best practice is to monitor a few and then choose the meaningful one based on dataset and business problem.
- Mean Squared Error :check:
- Root mean squared error
- popular, since reports unit in actual, unlike MSE, where they are squared
- mean absolute error
- more robust to outliers unlike MSE, since it does not penalises large errors as MSE
- R squared
- compares model performance as with baseline model :question:
Based on confusion matrix:
- Accuracy
:warning: High Accuracy Paradox - Positive class in minority
- Precision -> exactness
- Recall -> completeness
- F1-Score
- balances both precision and recall
Underfitting Indicators:
- Low performance scores on train, val, test data.
-
Overfitting Indicators:
- Model performing well on train, but not on val &/or test
Remediations
- Train and validate on different dataset
-
- Select features to train model
2.1 EDA
Geolocation data libraries
- Folium
Resources
- Experimental Design and Analysis - Howard J. Seltman
Objective: MLU
- Discover Patterns
- Spot anomalies
- Look for insights for modelling choices
Patterns
- Identify feature pairs that are highly correlated, remove one :red_cross:
- identify the feature-target pair that are highly correlated, keep one :check:
:warning: High correlated feature
- Linear/Logistic regression models may degrade performance when highly correlated features (rooms, sqft e.g.)
- Decision tree is immune to this problem :question: How
- :star: Linear/Logistic regression models may improve performance when highly target-correlated features
Anomalies:
- Missing Data
- Class Imbalances
- Outlier Detection
-
2.2 Feature Engineering
look out for other important features that we can extract from given data that
- provide more insights
- help in better modeling
3.1 Prepare Data
Handle class imbalances
- use classification metrics like precision, recall, f1 score instead of accuracy
- downsampling: remove random dominant class records
- upsampling: duplicate random minority class records
- data augmentation/generation: create similar, new records as of minority class
- sample weight in cost function: higher weightage to rare classes
Handling Missing Values
- Drop rows/columns
- Impute values
- average value
- common point/ mode
- placeholder text
- advanced imputation
Num data - mean/median
num/cat data - mode/placeholder
- Used trained classifier to make predictions
-
-