Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapters 13, 14, &15 (Chapter 15- Build Candidate Models (starting the…
Chapters 13, 14, &15
Chapter 13- Startup Processes
uploading data
-opening page is often most recent file worked on
easiest way to bring data set into DR is to read in a local file- can be a file that has been carefully prepared through the approaches outlined in section III or appendix A
generally several datasets are supplied
stick to downsampled datasets while learning, larger datasets have serious risks
While data is being processed, click on "Untitled Project"
name project, can create tags to track projects, to do this click on the file folder inside the circle symbol to bring up a new menu where you can create projects and manage them->click manage projects-> will list all prior and currently running projects -> click tags and type in general name for type of project that can be applied to similar
tags become more important with more projects completed
DR will accept comma separated values files (.csv)
these are files where first row contains the colunm names with a comma b/w each, following rows have data w/a comma b/w each
on left side of .csv file, anything listed before the comma on its line belongs in column 1
if data is online use the "url" button; if the data is from a database, use "ODBC"; if its from hadoop use "HDFS"
compressing data as .gzip .bzip2 .zip .tar .gz .tgz tar.bz2 sppeds up data upload
Chapter 14- Feature Understanding and Selection
after setting up data, interpret data contained in each feature (feature understanding)
descriptive stats
index number used to specify which features are being discussed
unique is listed after var type and notes how many unique values exist for each specific feature
any feature name can be clicked to show more details
to right of unique column- info on standard descriptive stats like mean, sd, median, min, max (only available for numeric columns)
[] denotes an inclusive range where () denote exclusive range (range includes number next to bracket . but not next to parenthesis)
68-95-99.7 rule
68% of data is w/i -1 to 1 SD, 95% of data w/i -2-2 sd's, and so on
Data types
data type indicated the nature of the data inside a feature
binary categorical(two categories); multi-class categorical (many categories)
other most common type is numeric
any type of number including integers and decimals
boolean
two values, true or false
text type
DR does examine whether any auto generated features already exist in dataset and in such cases, doesn't generate a new feature
Missing Values
missing column outlines how many values are missing from a specific feature
? = missing value in DR
why care about these?
many ways for missing values to be handled; can converta all to one consistent type; even if missing value is treated correctly, some algorithims that will ignore a whole row if there is a single missing value in any cell of that row, leading to deleterious effects for the model
algorithms that struggle with missing values: regression, nerual netowrks, support vector machines
can be nulls
Chapter 15- Build Candidate Models
data ready to be used in creating firstdeep-learning, neural network model, and others
these models serve to improve understanding of what combos of data, preprocessing, parameters, and algorithms work well when constructing models
Starting the process
Select target feature
once selection is made, top of window changes
logloss (accuracy)
means rather than evaluating the model directly on whether it assigns cases (rows) to correct "label" (t or f), the model is evaluated based on probabilities generated by the model and their distance from the correct answer
there are advanced options
starting the analytical process
prepares data through autopilot, quick, and manual
before this, reset the options to where they originally were
quick run
abbreviated version of autopilot that producses almost as good models by shortcutting DR best practice machine learning process
informative features
reps all data features except for the ones automatically excluded and tagged
step 4- loading dataset and preparing data
relevant if the data is large (over 500mb) and all inital evals before this step will have been conducted with a 500 mb sample of dataset and now the rest of dataset is loaded
step 5- saving the tareget and partitioning the info
step 6- importance scores calculated
step 7- calculating list of models where info from 3-6 is used to determine which blueprints to run in autopilot
model selction process
RMSE
root mean square error
frequently used measure of the diff between values predicted by a model or an estimator and the values actually observed