Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 12: Data Reduction and Splitting (12.1: Unique Rows (Remove…

- - - - special case of partial match removal
      - removal of full rows based on identical content in all columns
      - For a row to be removed
        
        all values in columns must match same values in a prior row
        
        unique order ID values = no 2 rows removed from ex
        
        doesn't matter which row retained: sorting data not required, only rows moved are identical to another row retained
    - - removal of full rows based on identical content of a few columns
      - data must be sorted
        
        Starting: rows to keep 1st
        
        Then: say which columns should be identical to remove duplicates
  - - - analye data, apply unique function
        
        only keep unique rows, throw out others
        
        keep only important times of day, relevant to desired target
        
        Then: summarize function
        
        each individual (unique) ID linked with unique long/lat locations
        
        Then apply: partial MR again - to obtain most frequent location
        
        Result: table only contains - device ID< long/lat labeled as: home location
  - - - Goal: count # of orders containing a dairy order
        
        could apply unique function - based on order ID row
        
        Result: leaves only one dairy product/order = easy count of # of dairy orders
        
        Since no sorting by product, should remove remaining products = no value to us
- - - - data then: filtered into 2 separate tables
        
        EX: when necessary - imputation of missing data
        
        use union = combine training and test data
        
        finding age: filter to place people with age into new training set then: calucluate mean age
        
        use as: value for all the rows in test dataset
        
        rows missing a value in age column initially
        
        2 more items...
        
        Or: get ML to predict age of passengers based on - all other features (not including target feature) apply to test set
        
        same columns, different rows
  - - - Example: important to union training and test data - make sure changes made uniformly to training and test rows
        
        Why? non-uniform modification to test and training sets will harm model's predictive ability when being tested
        
        when data mods/ done (feature engineering) separate training and test rows
        
        only training set can be used to train ML models
        
        could be: move rows with certain value to target column to train dataset
        
        those no special value moved to test set