Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12: Data Reduction and Splitting (12.2: Filtering (Common uses,…
Chapter 12: Data Reduction and Splitting
Removing rows from the dataset or splitting the dataset into different parts
Splitting rows, rather than splitting columns
12.2: Filtering
Tool used for splitting up a set of data into two separate tables based on characteristics of that data
Common uses
Test and train files
Make sure changes are made uniformly to training and test rows
In order to retain a model’s predictive ability
When a set of data transformations applies only to a subset of the available data
Imputation of missing data
Example: Northwind Dataset
Filtered based on whether or not the QuantityPerUnit column contains the string “kg” or not
However, criterion is flawed, as it picks up on the “kg” contained in “pkgs”
Can now split up based on whether the column contains “g” (grams) measurement
12.1: Unique Rows
Not uncommon for data to contain duplicate rows
Two ways to remove rows
Partial match removal
Removal of full rows based on identical content of a few columns
Data must first be sorted
List rows to KEEP first
Specify which columns should be identical (for appropriate duplicates to be deleted)
Complete match removal
Removal of full rows based on identical content in all columns
Special case of partial match removal
All values in all columns must match the same values in a prior row in order for that row to be removed
Partial Match Removal - Example
App data suggesting hundreds of locations for each user (one cell phone number for each user)
First interaction is assumed to be home interaction
Sort data by date and time (want the oldest row)
Select
unique
function
Keeps only unique rows, discarding any row containing data already encountered
Retain only the first row from each day
Leaves us with one record from each day
Select the
summarize
function
Can count the number of occurrences of each location’s DeviceID - most likely the one with the greatest number of occurrences is that user’s actual home
May have to use partial match removal to retain only the most frequent location
Resulting table containing DeviceID, longitude, and latitude can be labelled as home location