Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12: Data Reduction and Splitting (Filtering (Often there is a set…
Chapter 12: Data Reduction and Splitting
Unique Rows
Lots of data contains duplicate rows
Partial match removal
Removal of full rows based on identical content of a few columns
Data must first be sorted in the order listing the rows to keep first, followed by a specification of which columns should be identical for duplicated to be removed
Leaves us with one record from each day at a time
Summarize Function: each unique id can be grouped with each unique configuration of longitude and latitude into discrete buckets
Often put most frequent data first
Unique Function: Based on on ID row
very complex process
Complete match removal
No often needed because rows aren't usually identical
Removal of full rows based on identical content in all columns
Filtering
Often necessary and convenient tool for splitting up a set of data into tow separate tables based on characteristics of that data
Important to union train and test data in order to make sure that changes are made uniformly to training and test rows
Uses outside just test and train files
Often there is a set of data transformations that apply only to a subset of the available data
Useful during imputation of missing data
Can train a machine learning model to predict the age of passengers based on all other features and then apply that model to the test set of passengers without age values