Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12 - Data Reduction & Splitting (12.3: Sampling (data gathered…
Chapter 12 - Data Reduction & Splitting
12.1: Unique Rows
if there are duplicate rows - REMOVE the duplicate
partial match removal: removal of full rows of a few columns
data is sorted listing rows to keep first and then listing which columns should be removed and are duplicates
complete match removal: removal of full rows of all columns
summarize function
each unique device ID can be grouped with each unique configuration
12.2: Filtering
splitting up data into 2 separate tables based on characteristics
often there are data transformations that apply only to subset of data
can be separated into 2 different tables
can break up based on string type, ounces/pounds, etc.
12.3: Sampling
used to select smaller datasets that mirror characteristics of population
preserves processing power and time spent in analysis of larger sample populations
unbalanced dataset: one value is underrepresented relative to another in binary targets (yes, no, etc.)
data gathered is almost always from random sampling into several groups
first set of data extracted is HOLDBACK SAMPLE
will be used for final evaluation of model produced
usually 20% of data
evaluates models performance
this is RANDOMIZED
remaining data split into FOLDS (groups) - usually 5 or 10
right # of folds balancing processing power needed for cross-validation
5 fold validation: set aside fold 1 to represent TARGET VALUE - combine rows in remaining 4 folds and use these to create a model of which the columns explain the target
then given access to data in fold 1 to compare predicted vs. target value
if correct in 3 out of 4 cases for predicting zeros and 1 values accuracy is .75 (validations score)
entire process repeats but now using fold 2 as point of reference and so on
combine folds 1, 3, 4, 5
right 2/4 times - accuracy of .50
against fold 3 - right 1/4 times - accuracy of .25
against fold 4 - right 3.4 times - accuracy of .75
against fold 5 - right 1/4 times -accuracy of .25
Add 3+2+1+3+1= 10
10/20 correct total - accuracy of .50
random assignment would give .50 accuracy so model is not better than blind guess!