Please enable JavaScript.
Coggle requires JavaScript to display documents.
Larsen Chapters 13-15 (Ch14: Feature Understanding and Selection (14.4:…
Larsen Chapters 13-15
Ch14: Feature Understanding and Selection
14.1: Descriptive Statistics
features are listed in rows with their names under the Feature Name header and the order they were read into the index
information on standard descriptive statistics such as Mean, Std Dev (standard deviation), Median, Min, and Max
the range includes the number next to a bracket, but not the number next to a parenthesis
14.2: Data Types
indicates the nature of the data inside of a feature
categorical
numerical, including integers and decimals
14..3: Evaluations of Feature Content
[Reference ID] means that the system will not use it to predict your target
[Duplicate] because several other features contain the exact same values
[Too many values] applies to non-numeric columns
14.4: Missing Values
missing from a specific feature
?
by converting them all to one consistent type for missing values, an analyst can avoid them being categorized as multiple unique values during an analysis
nulls are values that were never entered or retrieved
algorithms struggling with missing values include regression, neural networks, and support vector machines
CH15: Build Candidate Models
15.3 Starting the Analytical Process
Quick Run is a version of Autopilot that produces almost as good models by shortcutting the DataRobot best practice ML process
Autopilot and Quick are identical except that for Autopilot DataRobot starts the analysis at 16% of the sample
Quick starts at 32% with models that have historically performed well
15.2 Advanced Options
Random is identical in its approach to what was discussed in Chapter 12.3: 20% of the data is assigned to the holdout sample
the Stratified option works a bit harder to maintain the same distribution of target values inside the holdout as the other samples
The (n)umber of folds can also be set manually
the Partition Feature is a method for determining exactly which cases are used in different folds
The Group approach accomplishes much of the same as with the partition feature (some differences)
15.1: Starting the Process
LogLoss (Accuracy): the model is evaluated instead based on probabilities generated by the model and their distance from the correct answer
DataRobot: in general a good measure
time to select the target feature
Chapter 13: Startup Processes
DataRobot
13.1: Uploading Data
easiest way to bring a dataset in is to read in a local file
DataRobot will accept .cvs, .tsv, Excel files
the data is stored in a database such as PostgreSQL, Oracle, or MySQL
assumed that feature names are unique
Steps
Apply the tag to the project by clicking Apply
while the data is being processed click on Untitled Project and name the project
click the Data link in the upper left corner to then return to the data screen
the link turns blue to illustrate that this is the screen currently being displayed