Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12. Data reduction and splitting (12.3 sampling (unbalanced…
Chapter 12. Data reduction and splitting
12.1 Unique rows
uncommon for data to have duplicate rows
when it occurs remove it
removal in two ways
partial match removal
removal of full rows based on identical content of a few columns
Must first first be sorted in the order listing the rows to keep first, followed by a specification of which columns should be identical for duplicated to be removed
Complete match removal
Removal of full rows based on identical content in all columns
this is a special case of partial removal
this means that for a row to be removed, all values in all columns must match the same values in a prior now
12.2 Filtering
often necessary and convenient tool for splitting up a set of data into two separate tables based on characteristics of that data
only training set can be used to train machine learning models
uses outside of train files
often there is a set of data transformations that apply only to a subset of the available data
in this instance, the data can befiltered into two separate tables
after using a union
you can combine both training and test data
then a filter can be used to place those passengers with an age indicated in the data into a new training set
data modification = feature engineering
12.3 sampling
generally used to select smaller datasets the mirror the characteristics of a population
while in machine learning it's used for both to create datasets that will be used to build models and datasets used to evaluate models for the purpose of ensuring that ML findings are generalizable to contemporary data and capable of predicting future behaviors and events
benefit
very large datasets
unbalanced dataset
occurs when one value is underrepresented relative to the other in what are called binary targets
common in ML
EX: training datasets whose target values indicate whether someone is afflicted be a specific disease or whether someone clicked on or bought a specific product advertised to them
first set of data that extracted is the holdback sampl
this will be used for the final evaluation of the models produced
unique identifier
a feature used to predict the target