Advanced RegEx
RegEx, more commonly referred to as Regular Expressions, is a string describing a data pattern. We can use RegEx to check for a variety of purposes such as:
Python provides the module re to work with Regular Expressions. To import the re module, use the following statement:
Sequence | Description |
---|---|
\A | This sequence specifies that the string should start with the specified pattern. |
\b | This sequence specifies that the specified characters should be at the beginning of the end of the word. |
\B | This sequence specifies that the specified characters must be present, but not at the word's beginning or end. |
\d | This sequence will return the matching string where digits are encountered(0-9). |
\D | This sequence will return a string that DOES not contain any digits. |
\s | This sequence will return all the white-space characters. |
\S | This sequence will return all the non-whitespace characters. |
\w | This sequence will return a string containing only a-z, A-Z, 0-9, and the underscore(_) character. |
\W | This sequence will return a string that will NOT contain a-z, A-Z, 0-9, and the underscore(_) character. |
\Z | This sequence will return the specified string if it is at the end. |
\A: This sequence specifies that the string should start with the specified pattern.
['Jingle']
\b: This sequence specifies that the specified characters should be at the beginning of the end of the word.
['Jingle', 'Jingle', 'Jingle']
\B: This sequence specifies that the specified characters must be present, but not at the word's beginning or end.
['impossible']
\d: This sequence will return the matching string where digits are encountered(0-9).
['2']
\D: This sequence will return a string that DOES not contain any digits.
I have credit cards.
\s: This sequence will return all the white-space characters.
[' ', ' ', ' ', ' ']
\S: This sequence will return all the non-whitespace characters.
Ihave2creditcards.
\w: This sequence will return a string containing only a-z, A-Z, 0-9, and the underscore(_) character.
I_have_2_credit_cards
\W: This sequence will return a string that will NOT contain a-z, A-Z, 0-9, and the underscore(_) character.
!!!
\Z: This sequence will return the specified string if it is at the end.
['Bell']
Set | Description |
---|---|
[chars] | A match will be returned if the characters specified inside the [] are present in the string. |
[char_range_begin-char_range_end] | A match will be returned if the characters range specified inside the [] are present in the string. |
[^chars] | A match will be returned except for the characters specified. |
[digits] | A match will be returned of the digits specified. |
[digit_begin-digit_end] | A Match will be returned for the digit range specified. |
[digit_begin-digit_end][digit_begin-digit_end] | A match will be returned for the subsequent digits following the pattern . |
[char_begin-char_endchar_begin-char_end] | A Match will be returned if it fits the specified range. |
[chars]: A match will be returned if the characters specified inside the [] are present in the string.
[]
The above example will return an empty List because the lowercase a, b, c characters are not present.
['J', 'K', 'L']
The above example, however, returns J, K, L characters.[char_range_begin-char_range_end]: A match will be returned if the characters range specified inside the [] are present in the string.
[]
[^chars]: A match will be returned except for the characters specified.
['D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
[digits]: A match will be returned of the digits specified.
['2', '4', '6', '8', '0']
[digit_begin-digit_end]: A Match will be returned for the digit range specified.
['1', '2', '3', '4', '0']
[digit_begin-digit_end][digit_begin-digit_end]: A match will be returned for the subsequent digits following the pattern .
['20', '21', '22', '23', '24']
You are allowed to put N number of subsequent cases.[char_begin-char_endchar_begin-char_end]: A Match will be returned if it fits the specified range.
['A', 'h', 'd', 'b', 'd', 'h', 'k', 'd', 'k', 'j', 'd']
Meta Character | Description |
---|---|
[] | Square Bracket describes a set of characters. |
\ | This character is used to specify a special sequence. We can also use this to escape special characters such as the character '\' itself. |
. | This character dot(.) specifies any character, except for the newline character. |
^ | This carat(^) character is used to specify that the string should start with the pattern specified. |
$ | This dollar($) character is used to specify that the string should end with the pattern specified. |
* | This asterisk(*) character will search for zero or more occurrences. |
+ | This character will search for one or more occurrences. |
{} | In this, specify the exact number of occurrences. |
| | The pipe(|) character specifies either of the pattern set. |
() | This is used for grouping |
[]: Square Bracket describes a set of characters.
['d', 'a', 'c']
In this example, we have specified to find all the characters within the specified set: a-d(lowercase).\: This character is used to specify a special sequence. We can also use this to escape special characters such as the character '\' itself. We have covered this later in detail, as this can be pretty confusing for beginners. You can read it here.
['1']
The above example searches for all the occurrences of a digit(not a number), as the below example demonstrates.
['2', '1']
.: This character dot(.) specifies any character, except for the newline character.
anonymous
This will search for a word starting with 'anon', and then having three non-newline characters, and ending with the characters 'us'.^: This carat(^) character is used to specify that the string should start with the pattern specified.
C:\Users
$: This dollar($) character is used to specify that the string should end with the pattern specified.
my_folder
*: This asterisk(*) character will search for zero or more occurrences.
['ello', 'ell']
Here, the string must have the substring "ell" with zero or more 'o' characters.+: This character will search for one or more occurrences.
[]
{}: In this, specify the exact number of occurrences.
['ello', 'ello']
This example mentions that the substring must have at least 1 'o' character. In the above string, the word Hello ends with ello.|: The pipe(|) character specifies either of the pattern set.
['favourite']
['preferred']
(): This is used for grouping .
Honda-CRV8
The findall method in the RegEx module will find all the matching content and return them as a List.
['World']
[]
The search method in RegEx's re module will search the string for a match. If the search() method finds the matching pattern, it will return a Match object.
anonymous
The split method will return a list containing the characters with a split of the matching character(s).
['C:', 'Users', 'anonymous', 'Desktop', 'my_folder']
Let us break down the most confusing statement of the example:\\\\ (Becomes) ---> \\
\\ (Becomes) ---> \
The sub method stands for substitution. Hence, as the name suggests, this method will replace the given pattern with a specified replacement.
C:\Users\no_name\Desktop\my_folder
In Python's RegEx, a match object contains information about the search and the result. If there is no match, then None is returned.
The match object has three methods/properties:
The span() method will return the start and end position of the matching string.
anonymous
The string property will return the string passed into the method.
C:\Users\anonymous\Desktop\my_folder
The group() method will return the matching string.
anonymous
String processing is an essential element of almost all applications. Hence, having powerful string processing methods such as Regular Expressions(RegEx) ease the development process. This tutorial taught us various rules, syntax, and special characters to find and extract strings based on a pattern.