Operations Applied to Two Columns
Less Than <
If the number of seats available in a family’s largest car is smaller than their family size after the birth of a new child, this feature may be predictive of the purchase of new car.
Less Than or Equal <=
If the number of bedrooms in a family’s home is smaller than or equal to the family size one year after the birth of a new child, this feature may be predictive of the purchase of new bunk beds.
By dividing one column by another, sometimes information that is otherwise hidden from some types of algorithms can be made available.
Greater Than >
If the family unit is larger than the number of seats in the largest car owned by the family, fun, summer camping vacations to national parks become less likely.
Sometimes two related columns interact with a target in a way that is only detectable through their product. This interaction effect is often called a moderated relationship between a column and the target. The moderation comes from the size of another feature
Greater Than or Equal >=
If the number of available seats in a car is greater than or equal to the family size, the likelihood of purchasing a new van may be lower.
Similar to subtraction but uses actual distance between two numbers rather than whether it is negative or positive of importance
Not Equal !=
When two data points are not the same, this can have an effect on a prediction.
Subtracting one column from another the similarity or difference between them becomes more apparent. The closer to 0 the more similar
If two data points are the same, they may cancel each other out in some cases, or they may indicate a higher likelihood of a target phenomenon occurring.
Adding columns predictive signals can be increased
When dealing with financial data, the current interest of a continuously compounded loan or bond may be of interest. To represent P = C e^(rt) in data transformations, C
t) can be used to capture the exponential relationship between e and (r*t)
Operators Applied to One Column
Square Root Sqrt()
Works for a different distribution of data. Compare ability to
linearize data vs log.
Makes large values even larger.
Natural Logarithm Log()
Generally used to linearize exponential data