ch. 12 data reduction and splitting (12.3 sampling (benefit for sampling…
ch. 12 data reduction and splitting
12.1 unique rows
two ways to remove duplicates: 1. partial match removal: removal of full rows based on identical content of a few columns
complete match removal: removal of full rows based identical content in all columns
Partial match, data must be first sorted in order listing rows to keep first, followed by specification of which columns should be identical for duplicates to be removed
ex. home address of cell phone. to find user's home address, sort data by device id and then by date and time. then select only device id and date columns applying a unique function (keeps only unique rows, discarding already encountered data)
then summarize function each unique device id can be grouped with each unique config of longitude and latitude into buckets
complete match removal: speical case of partial match removal. for row to be removed, all values in all columns must match same values in prior row.
Often necessary and convenient tool for splitting up set of data into two separate tables based on characteristics of that data
moving rows with certain value in target column into test set vs. train set. (ex. important to union training and test data in order to make sure changes are made to both rows. Once data mods are concluded, training and test rows need to be separate again.
ex. in Table 3, kg flaw still contains useful info bc remaining table only contains cans+bottles measured in oz. Split table 3 based on whether it contains grams measurement or not in QuantityPerUnit
samping used to select samples that mirror characteristics of a population. In ML sampling used both to create data sets to build models and datasets used to evaluate models (ML can predict future behaviors and events (customer, clients, website visitors, weather,etc.)
benefit for sampling in large datasets is preservation of processing power and time spent analyzing
enough rows retained for effective prediction making
unbalanced data set coomon in ML, when one value is underrepresented relative to the other in binary targets (yes no values)
apply filter tool to create two tables, one for each class of target before downsampling
Step 0: randomize data and pick holdout (sample and folds (groups)
S1: build model with combo of folds 2+3+4+5, validate on fold 1. Once model prepared, given access to all data in fold 1. Model now employed to predict actual value in target column. Validation Score later explained, accuracy measure used to see how many of true zeros the model predicted, as well as true ones in a binary target
S2: Build model with combo of folds 1+2+3+4+5, validate on fold 2. only occurs if cross validation is deemed appropriate and valuable for given model. This 2-5 step is conducted as a single process cross validation. Validation sample now hidden from algorithm. Once 1+3+4+5 algorithm creates model, its applied to validation sample (fold 2) which also predicts state of target variable for each row and then calculates success accuracy score
S3. But four in the book. Validation follows same actions in step 2. validation sample moved down to fold 3. 1+2+4+5 conducts training with accuracy of .25 percent. (1/4 rows correctly assigned)
S4: But five in the book. Validation sample moved down to 4. Training conducted with combined folds 1+2+3+5. Validation run against fold 4 with .75 accuracy 3/4 rows correctly assinged
S5: But six in the book. Validation moved down to fold 5. 1+2+3+4. with accuracy of .25 (1/4 rows correctly assigned)
S6:seven? overall accuracy for cross validation calculated (every non-holdout row in training used 5x, 4x to construct model and 1x to eval other model) Model accuracy calculated by checking true target value for all 20 rows against their predicted values.
ex. number of correct predictions was 3+2+1+3+1=10 out of 20 accuracy of 50 percent.
section IV: Model Data
Data robot is a living tool that takes care of requirements needed to conduct high-qual data science
Columns in our dataset=column. Will switch to start using language of machine learning and refer to columns as features. column containing our target will = the target
Machine Learning pipeline. Real world data in the past extracted, features w/ Target created and submitted to the Model Data stage. In the present, Data submitted for new patient once treated, and their electronic health record updated and put into database. Production system them applies same process (but w/o target) then submitted into model in production to create target, and prediction and recommendation come out as a result