Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch 13-15 Auto ML & Data Robot (TIPS (Analyzing ten small datasets…
Ch 13-15 Auto ML & Data Robot
TIPS
Analyzing ten small datasets rather than one large dataset is much more valuable for learning the content in this book
Name your projects and create tags for multiple projects to better identify them
A bracket denotes an inclusive range whereas a parenthesis denotes an exclusive range when looking at histogram bins
Data Robot only shows you the first 50 features you can view them all by scrolling to the bottom
Data Robot will tag features with too few values or duplicate ones to leave out of its analysis
Data formats
data is stored in a database such as PostgreSQL, Oracle, or MySQL (ODBC)
data is in a Hadoop system and you have an Enterprise DataRobot account (HDFS)
data is available in any of the above formats, is on the web, and has a direct URL (web link) linking to it (URL)
Compressing data to .gzip, bzip2, or zip can speed up the data upload
Feature Understanding and Selection
it is always important to scrutinize the automatic coding done by a machine learning system so that it correctly codes these as categorical rather than numeric
Common data types
Categorical
Numeric
Bouleon
Text
Build Candidate Models
LogLoss (Accuracy) simply means that rather than evaluating the model directly on whether it assigns cases (rows) to the correct “label” (False and True), the model is evaluated instead based on probabilities generated by the model and their distance from the correct answer
The only difference between Random and Stratified is that the Stratified option works a bit harder to maintain the same distribution of target values inside the holdout as the other samples
Partition Feature: user determines exactly which cases are used in different folds
Group: allows for the specification of a group membership feature. DataRobot makes decisions about where a case is to be partitioned but always keeps each group (those with the same value in the selected feature) together in only one partition.
Date/Time: making sure that all validation cases occur in a time period after the time of the cases used to create models