Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12: Data Reduction and Splitting (Filtering (Filtering is an often…
Chapter 12: Data Reduction and Splitting
Splitting Rows
When data contains duplicate rows, duplicates should be removed
Partial match removal: Removal of full rows based on identical content of a few columns
Complete match removal: Removal of full rows based on identical content in all columns.
determine someone’s home address
based on location data from their cellphone
determine someone’s home address
based on location data from their cellphone
date, time, longitude, and latitude of the cell
phone every time the app is used
sort the data by device id and then by date and time
select only the device id and date columns
unique functioning retaining only the first row from each day
Using the summarize function, each unique device id can be
grouped with each unique configuration of longitude and latitude into discreet
buckets
the most frequent location of the call is likely to be where a person lives
Filtering
Filtering is an often necessary and convenient tool for splitting up a set of data into
two separate tables based on characteristics of that data
After using a Union to combine both training and test data, a filter can be used to
place those passengers with an age indicated in the data into a new training set
calculate the mean age for these people and then use this value as the
age value for all the rows in the test dataset
Modeling Data
using our data to train the
algorithms to create models used to predict future events or understand past
events
DataRobot is a “living” tool that is
constantly updated and improved
Sampling
used both to create datasets that will be used to
build models and datasets used to evaluate models for the purpose of ensuring that
machine learning findings are generalizable to contemporary data and capable of
predicting future behaviors and events
preservation
of processing power and time spent in analysis
unbalanced dataset occurs when one value
is underrepresented relative to the other in what are called binary targets
filtering tool to create two tables, one for each class of target, before downsampling
the majority class
Holdback Sample: The first set of data that extracted
used for the final evaluation of the model
Randomize the order of the data and select a holdout sample
First, set aside Fold 1 (illustrated in yellow), combine the rows in the
remaining four folds (2–5), and use these rows to create a model of which
features (columns) drive (explain) the target
1 more item...
test and train files
#