Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 12: Data Reduction and Splitting (Sampling (Use in ML (Steps...…

- - - - Removal of full rows based on identical content of a few
      - In order to conduct a partial match, you must sort the rows to keep first, followed by duplicates to be removed
        
        1) Using a unique function, one can discard ant data already encountered, like multiple IP addresses
        
        2) Using the summarize function, each unique id can be group with each unique config
        
        3) If necessary, repeat step 1 and 2 until you find the specific data point you want. Ex: A person connects to multiple IPs, but you only want the home ip. Find the most occurring ip and work your way to isolate it
    - - Removal f full rows based on identical content in all
      - Is a special case of partial match removal, we mainly focus on partial match in this chapter
- - - - Ensures that ML findings are generalizable to contemporary data and capable of predicting future events
    - - Occurs when one value is underrepresented relative to the other.
      - To fix...
        
        1) Apply filter tool
        2) Create two tables 3)Downsample the majority class
        4) Join back together with a union
    - - 1) Randomize the order of the data and select a holdout sample. Split into folds
      - 2) Ignore the holdback sample and focus on the remaining holds. Validation score is the amount of correct predictions (lots of processing power)
      - 3-6) Use cross validation to help predict accuracy score of data
      - 7) Calculate overall accuracy of all data
      - 8) With feedback, make improvements to the data set. Run steps 2-6 until data is deemed accurate enough