Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12: Data Reduction and Splitting (12.1: Unique Rows (Remove…
Chapter 12: Data Reduction and Splitting
12.1: Unique Rows
Remove duplicate rows
complete match removal
special case of partial match removal
removal of full rows based on identical content in
all
columns
For a row to be removed
all values in columns must match same values in a prior row
unique order ID values = no 2 rows removed from ex
doesn't matter which row retained: sorting data not required, only rows moved are identical to another row retained
partial match removal
removal of full rows based on identical content of a
few
columns
data must be sorted
Starting: rows to keep 1st
Then: say which columns should be identical to remove duplicates
EX: Bus goal - find home address based on location data from cell phone
people move a lot, 100s of possibles
analye data, apply unique function
only keep unique rows, throw out others
keep only important times of day, relevant to desired target
Then: summarize function
each individual (unique) ID linked with unique long/lat locations
Then apply: partial MR again - to obtain most frequent location
Result: table only contains - device ID< long/lat labeled as: home location
Another Example
dairy product category - product sold
Goal: count # of orders containing a dairy order
could apply unique function - based on order ID row
Result: leaves only one dairy product/order = easy count of # of dairy orders
Since no sorting by product, should remove remaining products = no value to us
Removing rows from data set or splitting into different parts
12.2: Filtering
outside of test and train files
often: data transform that only apply to a subset of available data
data then: filtered into 2 separate tables
EX: when necessary - imputation of missing data
use union = combine training and test data
finding age: filter to place people with age into new training set then: calucluate mean age
use as: value for all the rows in test dataset
rows missing a value in age column initially
2 more items...
Or: get ML to predict age of passengers based on - all other features (not including target feature) apply to test set
same columns, different rows
splitting up a set of data into 2 separate tables
Based on: characteristics of that data
Example: important to union training and test data - make sure changes made uniformly to training and test rows
Why? non-uniform modification to test and training sets will harm model's predictive ability when being tested
when data mods/ done (feature engineering) separate training and test rows
only training set can be used to train ML models
could be: move rows with certain value to target column to train dataset
those no special value moved to test set