Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 12: Data Reduction and Splitting (Tools (Feature Engineering: Data…
Ch. 12: Data Reduction and Splitting
Splitting rows
Different reasons/approaches for splitting data
Unique Rows
2 ways 2 remove duplicates
Partial Match Removal
Remove full rows based on identical content of a few columns
Data must be sorted first in order of rows to keep
Followed by a specification of which columns should be identical for duplicates to be removed
Ex: Finding home address
Use summarize function to group data into buckets
Ex: Dairy Products
Removing ProductID and ProductName
Complete Match Removal
Remove full rows based on identical content in all columns
CMR is a special case of PMR
For a row to be removed, all values in all columns must match the same values in prior row
Filtering
Splits up a set of data into 2 separate tables based on characteristics
Ex: Splitting dairy quantity & measurment example
Tools
Feature Engineering: Data modifications
Unique Function: keeps only unique rows, discarding rows containing data already encountered
Summarize Function: groups data into discrete buckets