Please enable JavaScript.
Coggle requires JavaScript to display documents.
ML Model development - Coggle Diagram
- 
- ML Model development 
- 
- 
-  Training and test dataset
 
- 
- 
- 
- Test Set 
- 
- Test Set- 
-  for final evaluation & generalization error
 
 
 
 
 
- 
- 
- Other techniques to achieve - good fitmodel:- 
-  Early-stopping
-  Regularization tech
-  Ensemble methods
 
 
 
 
- 
- 
-  Model Evaluation
 Report performance of classifier
 
- 
- Regression Metrics
 Just like anything else in ML, there is not a clear line on what to use. Best practice is to monitor a few and then choose the meaningful one based on dataset and business problem.
 
- 
- 
-  Mean Squared Error :check:
- Root mean squared error
 
-    popular, since reports unit in actual, unlike MSE, where they are squared
 
- mean absolute error
 
-    more robust to outliers unlike MSE, since it does not penalises large errors as MSE
 
- R squared
 
-    compares model performance as with baseline model :question:
 
 
 
 
- 
- 
- Based on confusion matrix:- 
-  Accuracy
 :warning: High Accuracy Paradox - Positive class in minority
-  Precision -> exactness
-  Recall -> completeness
- F1-Score
 
-    balances both precision and recall
 
 
 
 
 
 
- 
- 
- Underfitting Indicators:- 
-  Low performance scores on train, val, test data.  
 
 
- 
 
- 
- Overfitting Indicators:- 
-  Model performing well on train, but not on val &/or test
 
 
- 
- Remediations 
- 
- 
-  Train and validate on different dataset
 
 
- 
 
 
 
- 
- 
-  Select features to train model
 
- 
- 2.1 EDA 
- 
- Geolocation data libraries- 
-  Folium
 
 
 Resources- 
-  Experimental Design and Analysis - Howard J. Seltman
 
 
 
- 
- Objective: MLU- 
-  Discover Patterns
-  Spot anomalies
-  Look for insights for modelling choices
 
 
- 
- Patterns- 
-  Identify feature pairs that are highly correlated, remove one :red_cross:
-  identify the feature-target pair that are highly correlated, keep one :check:
 
 
- 
- :warning: High correlated feature- 
-  Linear/Logistic regression models may degrade performance when highly correlated features (rooms, sqft e.g.)
-   Decision tree is immune to this problem :question: How
-  :star: Linear/Logistic regression models may improve performance when highly target-correlated features
 
 
 
 
 
- 
- Anomalies:- 
-  Missing Data
-  Class Imbalances
-  Outlier Detection
 
 
 
- 
 
 
- 
- 2.2 Feature Engineering
 look out for other important features that we can extract from given data that- 
-  provide more insights 
-  help in better modeling
 
 
 
 
- 
- 3.1 Prepare Data 
- 
- Handle class imbalances- 
-  use classification metrics like precision, recall, f1 score instead of accuracy
-  downsampling: remove random dominant class records
-  upsampling: duplicate random minority class records
-  data augmentation/generation: create similar, new records as of minority class
-  sample weight in cost function: higher weightage to rare classes
 
 
 
 
- 
- Handling Missing Values- 
-  Drop rows/columns
- Impute values  
 
-    average value
-    common point/ mode 
-    placeholder text
-    advanced imputation
 
 
 
 Num data - mean/median
 num/cat data - mode/placeholder
 
 
 
- 
- 
-  Used trained classifier to make predictions
 
 
- 
-