Chapter 14: Feature Understanding and Selection

Descriptive Statistics

Var Type

Unique

Missing

Mean

Std. Deviation

Minimum

Maximum

*Any feature can be clicked for more details

*Use histograms

Notice Cutoffs between each value

()

[]

Exclusive Range: doesn't include number next to (

inlcusive range -- includes number next to [

Data Robot uses [1,3) --> 1,2

Data Types

Categorical

Numerical

Boolean: True of False

Carefully organized data that is standardized

Improve pattern detection

Be able to view features beyond index 50

Convert Measurements to standardized numbers

DataRobot examines whether auto-generated feature already exists and does not generate a new feature

Evaluations of Feature Content

Missing Values

DataRobot ignores features with min. unique values

Missing Column

? is coded as missing call by dataRobot

Shows how many values are missing from specific feature

Convert them all to one consistent type:

Analyst can avowing being categorized as must. unique values

avoids treating missing value as text

Algorithms that struggle with missing values

regression

neural networks

support vector machines

Nulls

Nulls are values DataRobot converts to nulls

Other Codes for Missing

N/A, na, n/a, #N/A

inf, Inf, INF

Empty fields

DETECT ERRORS EARLY ON