Please enable JavaScript.
Coggle requires JavaScript to display documents.
Regular Expressions (Escape character indicates that the next character in…
Regular Expressions
Escape character indicates that the next character in RegEx pattern should be interpreted as a special command.
-
-
a\dc This shows the capability of RegEx as we now have a way to find any three letter text starting with “a” and ending with “c” having a number between 0 and 9 between them.
a/wc. The /w command finds any single
alphanumeric character. If you need more, specify how many. If you need any three character alphanumeric string, specify \w\w\w
a\Dc The \D augments our search to find patterns where a part of the string is anything but a digit.
a\Wc. The \W character finds any non alphanumeric character. Some examples
include: !%&’()*;=+./{}^~. \W Also finds space and return characters
[ ] Square brackets allow for the specification of a list of allowable characters for a
specific character position.
Example to return names Kai or Lai, use [KL][a-z][a-z]
OR command is the |
ab|cd
The pipe command means that the pattern on the left of the pipe or the pattern on the right is acceptable. In this case, either the “ab” or the “cd” will qualify as a match.
a[Ga]c Hard brackets specify that one character in the specific position can be any character listed inside the brackets.
Repetition refers to occurrences in which it makes sense to specify that a pattern repeats a certain number of times within a string being analyzed.
• The first method is to use the asterisk (*) character, which declares that the RegEx pattern preceding it can be matched zero to an infinite number of times.
-
ab+ The pattern begins with “a” and must be followed by at least one “b.” ‘aab’ will be found, but just the last two letters of it.
• The asterisk allows for a RegEx pattern to be skipped entirely if it does not exist in a string (zero matches), or to be matched many times; for instance, \w* will match a string of letter characters a-z of any length (infinite matches)
• Finally, curly brackets “{}” specify exactly how many times a pattern can appear with the possibility of including lower and upper bounds.
a{3} The pattern is three “a.”
a{2,4} The pattern is between two and four “a’s.”
Regular expressions are powerful/flexible language for finding, replacing, or extracting content from text.
The primary goal of using RegEx for data science is to extract instances of a specified pattern in a string, thus allowing for the creation of a new column containing either a TRUE or a FALSE value (is the pattern matched or not).
Microsoft word is a similar, simpler version of this.
Caret (^) when not in square bracket, indicates beginning of the string. The pattern being searched for must exist at the start of the string. To specify anything but the alphanumeric values
A[^Ga]c
Same as hard brackets, but caret (^) at start inside bracket specifies that the characters listed cannot exist.
-
Reg ex can be used to further specify in which context you want to find a pattern (i.e searching for one you can specify if you want to look at beginning or end of line).
-
-
-
-
Ab?c The question mark (?) makes the character following it optional. In this example, a
match is found whether the “c” is there or not. This may seem wasteful because we
could have matched by just using the “ab” pattern, but we will soon learn about capturing patterns, in which case capturing the “c” (if it is there), will be important.
-
ab$
Dollar sign ($) specifies the end of the string being examined. The pattern must be at the end of the string.
Captures refer to the explicit notation (capturing) of any substring matching a RegEx pattern that can then be used to create a new column for use in machine learning.