Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12. Data Reduction and Splitting (Splitting Rows (Unique rows…
Chapter 12. Data Reduction and Splitting
Splitting Rows
Unique rows
Remove duplicates
Partial match removal method
Remove full rows if a few columns are the same
The data should order the rows based on whether they should be kept or not
EX: Tracking every geo location a user opens an app from. Remove all but the first entry every day and that's probably their home address
Sort every location into buckets - the one with the most hits = home, most likely.
This grouping happens with
Summarize
func.
Complete match removal method
Remove full rows if all columns have the same data
Filtering
Splitting data into 2 tables
EX: Sometimes data transformations can only happen to data with a certain column value. It is then separated and set into a training set.
It can also find missing data - like missing ages in a datasheet
It could predict the age based on other columns within that subcategory
Sampling
Used to usually select a smaller dataset from a given group
Basically taking a sample from a population.
The sample must have enough rows to remain representative of the full dataset
Holdback Sample
Usually uses 20% of the data
This data is locked away so it cannot be read when making a model
Rest of the data is split into groups/folds. 5 or 10 is common
The model uses the Features (columns) drive (Explain) the Target
1 more item...