Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 12: Data Reduction and Splitting (12.1 unique rows (data can…

- - - - removal of full rows based on identical content of a few columns
      - ex: business wants to determine someone's home address based on location data from their cell. Org's app reports device ID and date, time, long., lat., of cell phone each time app is used. App data suggests hundreds of locations for users homes bc they move around. Assume each users first interaction on app occurs in their home
        
        find ind. home following assumption to sort data by device id and by date and time. this will list app usage reports from beginning to end
        
        select only device id and date columns and applying a unique function retaining only first row from each day. this leaves one record from each day at a time when app user was likely to be home
        
        unique function: keeps only unique rows, discarding any row containing data already encountered
        
        still unknown w certainty which location represents home location
        
        using summarize function, each unique device id can be grouped w each unique configuration of longitude and latitude into discreet buckets. # of occurrences of each location can easily be counted now.
        
        with enough days of data and a few assumptions, it can be determined that the most frequent location of the call is likely to be where a person lives.
        
        1 more item...
      - goal: count # of orders containing dairy order
        
        apply unique function based on OrderID row
        
        this leaves one dairy product per order, enabling easy count of # of dairy orders. the remaining products are of no value and should be removed (data not sorted by Product ID or Product Name)
    - - removal of full rows based on identical content in all columns
      - special case of partial match removal
        
        for a row to be removed, all values in all columns must match same values in prior row
        
        no rows in table 12.1 (^) would be removed due to unique OrderID values
        
        in this case, sorting the data is not required because only rows that are moved will be identical to another row that is retained; doesn't matter which row is retained
- - - - reason for this: non-uniform modifications to test and training data sets will harm a model's predictive ability when being used
        
        once all data modifications are conlcuded, the training and test rows must be separated again. only the training set can be used to train machine learning models
    - - often: set of data transformations that apply only to a subset of available data. data can be filtered into two separate tables (same columns different rows)
        
        ex: imputation of missing data - look at Kaggle Titanic Dataset
        
        training and test dataset missing the Age for # of passengers
        
        after using a Union to combine both training and test data, a filter can be used to place those passengers with an age indicated in the data into a new training set
        
        calculate mean age for pple and use this value as age value for all rows in test dataset (rows initially missing value in age column)
        
        results of filter: 12.3, 12.4, 12.5
        
        more advanced: train machine learning model to predict age of passengers based on all other features
        
        apply that model to test set of passengers w/o age values