Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 12: Data Reduction and Splitting (Sampling (Use in ML (Steps...…
Chapter 12: Data Reduction and Splitting
Unique Rows
It is not uncommon for data to contain duplicate rows. Need to remove duplicates
Partial match removal
Removal of full rows based on identical content of a few
In order to conduct a partial match, you must sort the rows to keep first, followed by duplicates to be removed
1) Using a unique function, one can discard ant data already encountered, like multiple IP addresses
2) Using the summarize function, each unique id can be group with each unique config
3) If necessary, repeat step 1 and 2 until you find the specific data point you want. Ex: A person connects to multiple IPs, but you only want the home ip. Find the most occurring ip and work your way to isolate it
Complete match removal
Removal f full rows based on identical content in all
Is a special case of partial match removal, we mainly focus on partial match in this chapter
Filtering
Often necessary and convenient for splitting up data into two separate tables
Can use filtering to examine a column (like quantity type) in further detail
Only training set can be used to train ML Models
We use filtering to split up data into easier ways to code. Ex: The string to find kg will be also picked up in pkgs. TO fix this, we split up again
Sampling
Generally been used to select smaller datasets that mirror the characteristics of a population
Use in ML
Used to create datasets that will be used to build models and datasets
Ensures that ML findings are generalizable to contemporary data and capable of predicting future events
Commonly unbalanced
Occurs when one value is underrepresented relative to the other.
To fix...
1) Apply filter tool
2) Create two tables 3)Downsample the majority class
4) Join back together with a union
Steps...
1) Randomize the order of the data and select a holdout sample. Split into folds
2) Ignore the holdback sample and focus on the remaining holds. Validation score is the amount of correct predictions (lots of processing power)
3-6) Use cross validation to help predict accuracy score of data
7) Calculate overall accuracy of all data
8) With feedback, make improvements to the data set. Run steps 2-6 until data is deemed accurate enough
Benefits
For very large datasets, it preserves processing power and time spent on analysis
Used alongside learning curves for ML to understand how much data is enough
Exercises
What is the difference between partial match removal and complete match removal
Partial match removes full rows based on identical content of a few. Complete match removes full rows based on identical content of all data
Why do we cross-validate?
We use it to estimate how accurately a dataset/model will perform in the real world
Steven Chesney
Email:
stch1109@colorado.edu
Class: 3201-002
Prof: Larsen