Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Reduction and Splitting (Sampling (Holdback sample (First set of data…
Data Reduction and Splitting
Unique Rows
Remove duplicate rows
Partial match and remove
Removal based upon duplicates in some columns
Complete match and remove
Removal based upon duplicates in all columns
Sort by rows to keep first
Unique function
Function that only keeps only unique rows, discarding rows already encountered
Summarize function
Each unique id can be grouped into buckets, make it easy to count number of occurences
Filtering
Splitting data into two separate tables based upon characteristics of that data
Ex: separate rows containing certain characteristic into new table
Sampling
Unbalanced datasets
One value is underrepresented relative to the other in binary targets
Holdback sample
First set of data saved and used for evaluation of the model produced
Usually about 20% of the data
Steps to creating a model
Randomize data and select a holdout sample
Remaining data split into 5-10 folds
Set aside fold 1, create model combining data used in folds 2-5, use fold 1 for validation
Set aside fold 2, create a model with 1,3,4,5 and use 2 for validation
Calculate accuracy for each fold, and average the accuracies to have an overall accuracy
Re-structure, clean the data to attempt to train a better model