Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10: Data Transformations (Creating new features from data…
Chapter 10: Data Transformations
Creating new features from data
Text
Extract new columns from columns w/ text
Categorical
Combining many categories into fewer
Multi-categorical into binary columns
Numerical
+, -, *, etc. two or more columns to create new columns
Helps ML better predict target
Splitting and Extracting New Columns
IF-THEN statements
Examination of a value in a column and ability to make changes to it or other values in dataset
Allows you to create content in a new column (depending on what exists in one or more other columns)
EmployeeID example
If EmployeeID = 1, THEN Emp_1 = 1, ELSE Emp_1 = 0, ENDIF
One-hot encoding
Conversion of a categorical column containing two or more possible values into discreet columns representing each value
For each unique category in column, a new column is created with binary values of 1 or 0 (TRUE or FALSE)
"Dummy" encoding in statistics
Improves predictive ability
Likelihood of successful sale example (EmployeeID)
Transformations
Various transformations to create new features
Comparison of two columns can create a new column that provides additional predictive ability for ML algorithms
Addition (+)
Adding different columns can increase predictive signals
Ex: Adding family size to understand behavior in certain situations like air travel
Subtraction (-)
Subtracting one column from another to make the similarity or difference between them more apparent
Closer to 0 = the more simliar
Ex: Predicting whether a person is likely to engage in outside activities
Absolute (Abs())
Similar to subtraction, but used in cases where the actual distance between two number is of importance
Multiplication (*)
When two related columns interact with a target
Interaction effect is often called a moderated relationship between a column and the target
Ex: bad interaction with customer service vs. bad tempered customer & bad interaction in relation to churn
Churn = cancel customer realtionship
Division (/)
Makes information that is otherwise hidden made available
Ex: Dividing income by number of kids might reveal available funds
Less than (<)
If # of seats in car is smaller than family size, might be predictive of purchasing a new car
Less than or equal (<=)
If # of bedrooms is small than or equal to the family size one year after a new child, may be predictive of buying bunk beds
Great than (>)
If family is larger than # of seats in car, camping vacations become less likely
Greater than or equal (>=)
If # of seats is greater than or equal to family size, purchasing a new van likeliness may be lower
Not equal (!=)
When two data points are not the same, it can effect the prediction
If vibration of machine is different during operation than day before
Equal (==)
If two data points are the same, they may cancel each other out or indicate a higher likelihood of a target phenomenon occuring
Exponentiation (**)
Ex: current interest of a continuously compounded loan
P = C
e**(r
t) can be used to capture the exponential relationship of e and (r*t)
Transformations applied to one column at a time
Natural logarithm (Log())
Used to linearize exponential data
Ex: Higher a family's total income, less likely to visit national parks since they could afford other experiences; however, love of national parks would trump doubling income at some point(ex: $500,000 to $1,000,00 is not likely to negatively impact desire to visit NPs
Square root (Sqrt())
Similar to log transformation, but works for a different distribution of data
Square (Square())
Makes large values even larger
Ex: Square St.dev to find variance