Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12-Data Reduction and splitting (Sampling (can be unbalanced data…
Chapter 12-Data Reduction and splitting
Unique rows
duplicate rows need to be removed in a data set
1: Partial match removal
remove full rows based on identical content of a few columns
first sort data in order listing rows to keep first, then specify which columns should be identical for duplicates to be removed
"unique function" keeps only unique rows, discards and row containing data already seen
"summarize function" each unique device id can be grouped with each unique configuration of longitude and latitude into discreet buckets
another round of this may be required after sorting by device id and count(desc) of call location to retain only most frequent location of each device
2: Complete match removal
remove full rows based on identical content of all columns
special case of partial match removal
means that for a row to be removed, all values in all columns must match the same values in a prior row
Filtering
often necessary and convenient for splitting up data into two separate tables based on characteristics of the data
used for train and test files
ensures changes are made uniformly to training and test rows
non uniform modifications will harm a model's predictive ability when being tested
often there is a set of data transformations that apply only to a subset of available data
data can be filtered into two separate tables (same columns, diff rows)
can use a "union" to combine training and test data
"filter" to split the table
Sampling
generally used to select smaller datasets (samples) mirroring characteristics of a pop.
in machine learning used to create datasets that will be used to build models and used to evaluate models for the purpose of ensuring that m.l findings are generalizable to contemporary data and capable of predicting future behaviors and events
benefit: for very large datasets, preservation of processing power and time spent in analysis
learning curves used for M.L to understand how much data is enough
can be unbalanced data set
happens when one value is underrepresented relative to the other in what are called binary targets
binary targets
two target values aka "yes" and "no"
common in ML
holdback sample
first set of data extracted; used for the final evaluation of the model(s) produced
before releasing and using to evaluate other samples will be used for validating models
STEPS for validation, cross validation, and holdout sample creation
randomize order of data and select a holdout sample; 2: ignore the holdback sample for now and focus on remaining five folds - model is now employed to predict actual value in target column - once predictions are complete, ml system compares the true target value in fold 1 to its predicted target values and calculates a variety of success metrics ("accuracy" used to see how many of true zeros model predicted as 0 values and how many true ones predicted as on values--if its correct in 3/4 cases, this will later be known as validation score; 3: only happens if cross validation is needed and valuable for a model; 4: validation sample is moved down one more step and is now fold 3 ; 5: validatoin sample moved dow to fold 4; 6: validation sample moved to fold 5; 7: overall accuracy for cross validation is calculated... every row in non holdout training set used five times, four times to construct a model and one to evaluate another model; *: with feedback from validatoin/c.v, improvements can be made to datat set
exercises
difference b/w partial match removal and complete match removal
why do we cross validate?
Model Data
this section is about modeling data or using data to train algorithms to create models that can predict future events and understand passed ones
This data is extracted, has a target (T) and created to sumbit to 'model data' stage-model then produced and examined and then placed into production
once model is in production system, data from real world can be input into system
data robot
fulfills almost every one of criteria for automl excellence
performs best on AutoML criteria
is a "living tool"; constantly updated and improved