Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12: Data Reduction and Splitting (Sampling (Validation, Cross…
Chapter 12: Data Reduction and Splitting
Major topic: Splitting Rows
Unique Rows
removing duplicate rows
Partial Match Removal
: removal of full rows based on identical content of a few columns
data must be sorted in the order of listing the rows to keep first
example: figuring out longitude and latitude based upon date and time and then proceeding to eliminate extra long/lats
via using a
unique
function
summarize
groups unique configurations into discreet buckets
complete match removal
for a row to be removed, all values in all columns must match the same values in a prior row
Complete match removal:
removal of full rows based on identical content in all columns
Filtering
splits up data into separate tables based upon characteristics of that data
key in developing training for machine learning
used with data transformations that can only be applied to the subset
pairs this data with a new set of data via unions
also helps with predictions
Sampling
selects smaller datasets that mirror the population
used in both building and evaluating models
ensures that machine learning findings are generalizable for current future data
preservation of processing power and time spent in analysis
helps deal with/ work with unbalanced data sets
machine learning
random sampling is key
Adjusting the Data
"unique identifier" is used to predict the target and the "target" itself
Validation, Cross-validation, holdout sale creation
1.) holdout sample= randomize the data
2.) ignore holdback sample and focus on the other folds
employed to predict the actual value of the target
uses binary assignments
3.) comes with cross-validation only