Please enable JavaScript.
Coggle requires JavaScript to display documents.
CH. 10 Data Transformations (10.1.2 Regular Expressions (RegEx)…
CH. 10 Data Transformations
10.1.1 IF-THEN statements and One-hot encoding
IF-THEN statements allow for exam of a value in a column and the ability to make changes to that or other values in the dataset.
IF-THEN allows you to create content in new column depending
on what exists in one or more other columns
EX. EmployeeID column containing unique id for each person
IF EmployeeID = 1 THEN Emp_1 = 1 ELSE Emp_1 = 0
First step in splitting a column: (one-hot encoding)
conversion of categorical column containing 2< possible values into discreet columns repping each value
dummy encoding (each unique category in column, new column created with TRUE or FALSE 1,0 binaries
by hot-one encoding employee id values, predicitive ability improves
10.1.2 Regular Expressions (RegEx)
Understanding RegEx is likely best investment in data transformation skills
RegEx commands must be general
for specific strings to work, you need escape characters
indicates that next character in regex pattern should be interpreted as special command EX. reg ex has a backslash as the escape
"\d" returns any digital character
ex. "a\dc": "a1c" and "a5c" found but not "abc"
"\D" returns any characters besides digits including anything lowercase and uppercase and characters such as dashes
ex. "a\Dc": "aGc" and "a-c" found, but not "a1c"
"\W" returns any non alphanumeric character
example "a\Wc": "aGc" not found, "a-c" found, "a
c"
found
"." returns any character
ex. "a.c": "aGc" or "a-c" or "a1c" found
"\s" matches any whitespace
ex. "a\sc": "a c" found, "aa c" found, "abc" not found
question mark makes character following it optional
ab?c ex. abc found, abd found, adb not found
"a\Sb" any 3 character pattern starting with "a" and ending with "c" as long as its not a whitespace
ex. "a\Sb": "a c" not found, "aa c" not found, "abc" found
(^) carets specify anything but alphanumeric values within brackets while pipes "|" allow for matches on either side of the pattern like ab | cd
[ brackets ] allow specification of list of allowable characters for specific character position ex. "3" or "7" allowed : [37]\d\d or [73]d\d\
dashes allow us to specify range of allowable values ex. only lowercase from a to z are returned in the pattern [KL][a-z][a-z]
dollar signs specify end of the string being examined. the pattern must be at the end of the string ex. "ab$": aab found abc not found
CAPTURES: additional patterns can allow for other columns to made from data pulled ex. (Mr?s) has parenthesis which capture the substring "Mr and Mrs" matched by the pattern and the "?" allows for characters past Mr to be optional
REPETITIONS: specify that a pattern repeats a certain number of times within a string ex. (*) and zero matches or (+) used for bounds
curly brackets include upper and lower bounds ex. a{2,4} : "a" not found, aa found, aaa found
Regular expressions are a powerful + flexible language for finding, replacing or extracting content from text
Ex. specified pattern search like in Word when the text with the w word "one" is able to be marked in the text that fits the pattern