Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Transformations (different types of transformations to create new…
Data Transformations
different types of transformations to create new features
addition +
add columns for predictive signals to increase (ex: # spouse + # kids = family size)
Subtraction -
subtract columns so similarity/difference between becomes more apparent (ex: current indoor temp - current outdoor temp = likely to enjoy outdoors)
multiplication *
multiply columns that interact with a target in a way only detectable through the product (moderated relationship)
division /
divide columns to reveal info that may have been hidden (ex: family income / # kids = available funds for family)
less than <
less than or equal to <=
greater than >
greater than or equal to >=
equal ==
if points are the same, they might cancel out or indicate higher likelihood of target phenomenon
not equal !=
when 2 data points are not the same
absolute abs()
similar to subtraction, but use when distance between numbers is important
exponentiation **
creates an exponent (ex: P = C e^(rt) would be C
e**(r
t)
creates new features (columns) depending on type of data available
text - make new columns from columns containing text
categorical - combine categories together so there are fewer categories. multi-categorical columns can be binary columns
numerical - add, subtract, multiply 2+ columns to create new columns
Additional transformations that can only be applied to 1 column at a time
Natural logarithm Log()
used to linearize exponential data
Square root Sqrt()
comparable use with logs
Square Square()
makes large values larger (ex: standard deviation squared to find deviance)
IF-THEN statements - examines values and can make a changes to these values in the dataset
IF
EmployeeID
= 1 THEN Emp_1 = 1 ELSE Emp_1 = 0 ENDIF
places 1 and 0 into a data table, creating fewer columns for that category
one-hot encoding - binary values 1 or 0 are true or false, help machines analyze multicategorical information
can calculate details to better understand and predict future performance based on information gathered from data about employee/customer identities
Exercises
one-hot encoding changes multi categorical data into binary values