Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 10: Data Transformations (10.2: Transformations (These are the…
Ch. 10: Data Transformations
10.1: Splitting and Extracting New Columns
10.1.1: IF-THEN Statements and One-Hot Encoding
IF-THEN statements allows for information to be examined based on an IF statement (certain keyword or column) and a THEN statement which allows for a value can be changed based on the conditional IF statement
Most programming tools have the ability to do an IF-THEN statement
Examples of IF-THEN statements include the ability to group alike members based on a specific employee ID or to be placed in another category that separates them from others
One-Hot Encoding
Approach for splitting column data - conversion of a categorical column containing two or more possible values into discreet columns representing each value
By one-hot encoding, your predictive ability will most likely improve and organizes data based on repetition and past records
Useful tool if trying to link data together based on vague and consistent information such as a certain employee ID to their name so that other information may be connected to a name
10.2: Transformations
There are multitudes of feature transformations which allow for optimal performance through the use of them in various models
These are the following basic transformations mentioned in this chapter
Addition (+): predictive signals may be increased by including two columns together
Subtraction (-): subtracting two columns can determine whether they are similar or different; the closer the number is to zero=the more similar
Absolute (Abs): similar to subtraction - used in cases where the actual distance between two #'s rather than whether it is positive or negative
Multiplication (*): sometimes columns interact only through a product - interaction effect if often called a moderated relationship between a column and a target
Division (/): sometimes information is hidden that can be interpreted from division such as income/kids for available funds for a family
Less than (<): can determine the downward bound requirement for something such as car seats and size of family
Less than or equal (<=): similar to above but also includes the number being questioned at stake
Greater than (>): can determine the upward bound requirement
Greater than or equal (>=): same as above but also includes the number being questioned at stake
Not Equal (!=): when two data points are not the same, can have an effect on prediction
Equal (==): if two data points are the same, they may cancel each other out or indicate a higher likelihood of a target phenomena occuring
Exponentiation (**): used for current interest of a continuously compounded loan or bond may be of interest
Natural Logarithm (Log()): generally used to linearize exponential data
Square Root (Sqrt()): similar to log transformation, works for a different distribution of data,
Square (Square()): makes values even larger