Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10.1.2: Regular Expressions (RegEx) (EXAMPLES OF SPECIAL …
Chapter 10.1.2:
Regular Expressions
(RegEx)
DEFINITION
Flexible language for finding, replacing, or extracting content from text
Can be further used to specify the context in which a pattern exists
Examples: detection and extraction of phone numbers and parentheses
Can be used for any text that has a predictable format
Emails are quite predictable due to the @ symbol in the middle of text with no spaces with no spaces in it, followed by some characters, a period, some characters, a period, and a few more characters.
Example: MS word "SEARCH" functions as a most basic Reg Ex function
Examples of Simple
RegEx Commands
555
Would find - (555) 932-9382
Would not find (720) 828-8382
Any set of numbers may be specified
abc
Would find - abc Plumbing
Would not find - ab1 Plumbing
Any combination of letters may be specified
Reg Ex requires that commands be general
If looking for any 3 character string starting with a and ending with c rather than "abc", need an escape character
ESCAPE CHARACTER
Backslash - " \ "
Example: \d does not mean that we are looking for d, it means that any digit will do
If you wanted to find all combinations of 3 digits, \d\d\d
"\d" will return any digit character
"\D" will return any character except for digits
"\w" will return any alphanumeric character
(digits from 0-9, letters from a-z)
"\W" will return any non-alphanumeric character
"." will return any character
has the largest scope
"?" means that the character directly preceding it
is not necessary for a match
Example: British English "colour" vs. American English "color"
Pattern that works for both:
"colou?r" would return both
"\s" will find whitespace
(a space, tab, new line, and/or carriage return)
"\S" will return anything but whitespace
will indicate that the next character in the RegEx pattern should be interpreted as a special comman
EXAMPLES OF SPECIAL
CHARACTER USE
"a\dc"
any three letter text starting with "a" and ending with "c",
having a number between 0-9 in between them
would find "a1c"
would not find "abc"
"a\Dc"
"\D" means a (anything but a digit) c
would find "aGc", "a-c"
would not find "a1c"
"a\wc"
"\w" finds any single alphanumeric character
means a (anything alphanumeric) c
would find "aGc", "a1c"
would not find "a-c"
"a\Wc"
"\W" finds any non-alphanumeric character
also finds space and returns characters
would find "a-c", "a
c"
would not find "aGc"
"a.c"
"." (period) returns any character
means a(anything)c
would find "aGc", "a-c", "a
c"
"ab?c"
"?" makes the character following it optional
means ab(anything/space)
would find "abc", "abd"
would not find "adb"
"a\sc"
"\s" matches any whitespace
means a(whitespace)c
would find "a c", "aa c"
would not find "abc"
"a\Sb"
"\S" finds any character not a whitespace
means a(anything but whitespace)c
would find "abc", "a1c"
would not find "a c", "aa c"
Additional Specificity
in RegEx Commands
Square Brackets "[ ]"
Specifies contents of brackets as allowable characters for a specific character position
Also: specify anything but the alphanumeric values listed within square brackets by placing the carat "^" in the first position
inside
the square brackets
"[^123]" will return any alphanumeric characters except for 1,2,3 in that position in the expression
Examples:
"a[Ga]c"
Would find "aGc", "abc"
Would not find "a@ c"
"A[^Ga]c"
Would find "aTc"
Would not find "aGc", "abc"
"ab|cd"
Pipe command means the pattern on the left of pipe
or pattern on right are acceptable
Would find "abc", "bcd"
would not find "adc"
"^ab"
Caret specifies the beginning of the string being examined
would find "abc"
would not find "bcab", "aab"
"ab$"
Dollar sign specifies the end of the string being examined
would find "aab", "bcab"
would not find "abc"
Repetitions
Refers to occurrences in which it makes sense to specify that a pattern repeats a certain number of times within a string being analyzed.
Asterisk "*"
declares that the pattern preceding it can be matched zero to an infinite number of times
Allows for a pattern to be skipped entirely if it does not exist in a string (has zero matches), OR to be matched many times
Example: "\w*" will match a string of letter characters (a-z) of any length (infinite matches)
Using
"+"
changes the lower bound from zero to one
Similar to the asterisk, just changes the minimum number of matches to at least one
Using curled brackets "{ }" specifies exactly how many times a pattern can appear with the possibility of including upper and lower bounds
EXAMPLES
"ab*"
pattern begins with an "a" and may or may not contain any number of "b" 's
will find "a", "aab", "abbb"
"ab+"
pattern begins with "a" and must be followed by at least one "b"
"abbb" would be found
"aab" would be found, but only because of the last two letters
would not find "a"
a{3}
the pattern is 3 "a"
will find only "aaa"
would not find "a", "aa"
a{2,4}
the pattern is between 2 and 4 "a"
will find "aa", "aaa", "aaaa"
will not find "a"
Captures
The explicit notation (capturing) of any substring matching a RegEx pattern that can be used to create a new column for use in machine learning
Repetition patterns may be surrounded with parentheses to capture any substring that fits the pattern specified
Example: Using the parentheses around
a pattern to capture whatever
substring is matched by the pattern
"(Mr?s)"
Will capture "Mr", "Mrs"
Will not capture "Miss"