Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 12. Data reduction and splitting (12.3 sampling (unbalanced…

- - - - removal in two ways
        
        partial match removal
        
        removal of full rows based on identical content of a few columns
        
        Must first first be sorted in the order listing the rows to keep first, followed by a specification of which columns should be identical for duplicated to be removed
        
        Complete match removal
        
        Removal of full rows based on identical content in all columns
        
        this is a special case of partial removal
        
        this means that for a row to be removed, all values in all columns must match the same values in a prior now
  - - - only training set can be used to train machine learning models
    - - often there is a set of data transformations that apply only to a subset of the available data
        
        in this instance, the data can befiltered into two separate tables
    - - you can combine both training and test data
        
        then a filter can be used to place those passengers with an age indicated in the data into a new training set
- - - - EX: training datasets whose target values indicate whether someone is afflicted be a specific disease or whether someone clicked on or bought a specific product advertised to them