Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 12-Data Reduction and splitting (Sampling (can be unbalanced data…

- - - - remove full rows based on identical content of a few columns
      - first sort data in order listing rows to keep first, then specify which columns should be identical for duplicates to be removed
        
        "unique function" keeps only unique rows, discards and row containing data already seen
        
        "summarize function" each unique device id can be grouped with each unique configuration of longitude and latitude into discreet buckets
      - another round of this may be required after sorting by device id and count(desc) of call location to retain only most frequent location of each device
    - - remove full rows based on identical content of all columns
      - special case of partial match removal
        
        means that for a row to be removed, all values in all columns must match the same values in a prior row
- - - - non uniform modifications will harm a model's predictive ability when being tested
  - - - can use a "union" to combine training and test data
      - "filter" to split the table
- - - - two target values aka "yes" and "no"