Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 12: Data Reduction and Splitting (12.2: Filtering (Common uses,…

- - - - Make sure changes are made uniformly to training and test rows
      - In order to retain a model’s predictive ability
  - - - However, criterion is flawed, as it picks up on the “kg” contained in “pkgs”
      - Can now split up based on whether the column contains “g” (grams) measurement
- - - - Removal of full rows based on identical content of a few columns
      - Data must first be sorted
        
        List rows to KEEP first
        
        Specify which columns should be identical (for appropriate duplicates to be deleted)
    - - Removal of full rows based on identical content in all columns
      - Special case of partial match removal
      - All values in all columns must match the same values in a prior row in order for that row to be removed
  - - - First interaction is assumed to be home interaction
      - Sort data by date and time (want the oldest row)
      - Select unique function
        
        Keeps only unique rows, discarding any row containing data already encountered
      - Retain only the first row from each day
        
        Leaves us with one record from each day
      - Select the summarize function
        
        Can count the number of occurrences of each location’s DeviceID - most likely the one with the greatest number of occurrences is that user’s actual home
        
        May have to use partial match removal to retain only the most frequent location
      - Resulting table containing DeviceID, longitude, and latitude can be labelled as home location