Please enable JavaScript.
Coggle requires JavaScript to display documents.
CHAPTERS 13-15 (CHPT 14: FEATURE UNDERSTANDING & SELECTION (feature…
CHAPTERS 13-15
CHPT 14: FEATURE UNDERSTANDING & SELECTION
feature column
presents a bar chart initially sorted by size (# of rows)
shows descriptive statistics - mean, median, mode
each bar column has a range of bins
[ ] = inclusive range whereas a parenthesis denotes an exclusive range
range includes the number next to a bracket but not number next to a parenthesis
binary categorical - 2 varialbes
multi-class categorical - many categories
numeric data - any type of number - integers and decimals
boolean - categorical that holds one of two values - true or false
text data - less useful than numeric and categorical
RowID - unique identifier and marked with [reference ID]
missing column - outlines how many values are missing from a specific feature (?)
CHAPTER 15: BUILD CANDIDATE MODELS
LogLoss - model is evaluated instead based on probabilities generated by the model and the distance from the correct answer
false = 0
true = 1
random - pulls out a random % of sample
20%
split into N folds
small sets - more folds
Partition Feature - which cases are used in different folds
own random assignment of cases
Group - allows for specification of a group membership feature
makes decisions about where a case is to be partitioned but always keeps each group together together in one partition
Validation set is small (<= 10,000 cases) - run the full cross -validation process on the eight top models
blending - models are internally sorted by cross-validation score before the best models are blended
32% sample - 19 algorithms are configured and placed in the queue to work with 32% of the data
64% sample - eight best algorithms in the 32% round are selected for use with the remaining set of data, 64%.
CHPT 13: STARTUP PROCESSES
bring dataset into DataRobot - read in a
local file
strongly recommended to stick with smaller downsampled datasets
will accept comma separated value files - anything listed before comma on its line belongs to column 1, etc.
feature names are unique
joining different tables will invariably create columns with identical names
rename or remove features with identical names
uploaded dataset must be 100 data rows or more, less than or equal to 20,000 columns, and less than or equal to 1.5
gigabytes of an uncompressed text file
add tags to each project
general name for the type of project that can be applied to similar ones
once data is uploaded - it will "rapidly upload the data to the DataRobot cloud platform (step 1), read the data out of the file (step 2), and prepare exploratory data analysis results (step 3)"