Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12: Data Reduction and Splitting (12.1 unique rows (data can…
Chapter 12: Data Reduction and Splitting
12.1 unique rows
data can contain duplicate rows, which should be removed
partial match removal
removal of full rows based on identical content of a few columns
ex: business wants to determine someone's home address based on location data from their cell. Org's app reports device ID and date, time, long., lat., of cell phone each time app is used. App data suggests hundreds of locations for users homes bc they move around. Assume each users first interaction on app occurs in their home
find ind. home following assumption to sort data by device id and by date and time. this will list app usage reports from beginning to end
select only device id and date columns and applying a
unique
function retaining only first row from each day. this leaves one record from each day at a time when app user was likely to be home
unique function: keeps only unique rows, discarding any row containing data already encountered
still unknown w certainty which location represents home location
using
summarize
function, each unique device id can be grouped w each unique configuration of longitude and latitude into discreet buckets. # of occurrences of each location can easily be counted now.
with enough days of data and a few assumptions, it can be determined that the most frequent location of the call is likely to be where a person lives.
1 more item...
goal: count # of orders containing dairy order
apply
unique
function based on
OrderID
row
this leaves one dairy product per order, enabling easy count of # of dairy orders. the remaining products are of no value and should be removed (data not sorted by
Product ID
or
Product Name
)
complete match removal
removal of full rows based on identical content in all columns
special case of partial match removal
for a row to be removed, all values in all columns must match same values in prior row
no rows in table 12.1 (^) would be removed due to unique
OrderID
values
in this case, sorting the data is not required because only rows that are moved will be identical to another row that is retained; doesn't matter which row is retained
12.2 Filtering
filtering: necessary and convenient tool for splitting data sets into 2 tables based on characteristics of data
ex: suggested earlier it is important to union training and test data in order to make sure changes are made uniformly to training and test rows
reason for this: non-uniform modifications to test and training data sets will harm a model's predictive ability when being used
once all data modifications are conlcuded, the training and test rows must be separated again. only the training set can be used to train machine learning models
has uses outside test and train files
often: set of data transformations that apply only to a subset of available data. data can be filtered into two separate tables (same columns different rows)
ex: imputation of missing data - look at Kaggle Titanic Dataset
training and test dataset missing the
Age
for # of passengers
after using a Union to combine both training and test data, a filter can be used to place those passengers with an age indicated in the data into a new training set
calculate mean age for pple and use this value as age value for all rows in test dataset (rows initially missing value in age column)
results of filter: 12.3, 12.4, 12.5
more advanced: train machine learning model to predict age of passengers based on all other features
apply that model to test set of passengers w/o age values