RegEx | learnpython.tech

RegEx, more commonly referred to as Regular Expressions, is a string describing a data pattern. We can use RegEx to check for a variety of purposes such as:

finding a substring within a string
checking for the absence of a string
extracting strings matching a specific pattern
finding the location of the substring
validating the presence of a pattern

Regular Expression(RegEx) in Python

Python provides the module re to work with Regular Expressions. To import the re module, use the following statement:

import re

Special Sequences

Sequence	Description
\A	This sequence specifies that the string should start with the specified pattern.
\b	This sequence specifies that the specified characters should be at the beginning of the end of the word.
\B	This sequence specifies that the specified characters must be present, but not at the word's beginning or end.
\d	This sequence will return the matching string where digits are encountered(0-9).
\D	This sequence will return a string that DOES not contain any digits.
\s	This sequence will return all the white-space characters.
\S	This sequence will return all the non-whitespace characters.
\w	This sequence will return a string containing only a-z, A-Z, 0-9, and the underscore(_) character.
\W	This sequence will return a string that will NOT contain a-z, A-Z, 0-9, and the underscore(_) character.
\Z	This sequence will return the specified string if it is at the end.

Examples

\A: This sequence specifies that the string should start with the specified pattern.

import re
haystack = r"Jingle Bell! Jingle Bell!"
needle = "\AJingle"
x = re.findall(needle, haystack)
print(x)

['Jingle']

\b: This sequence specifies that the specified characters should be at the beginning of the end of the word.

import re
haystack = r"Jingle Bell! Jingle Bell! Jingle"
needle = r"\bJingle"
x = re.findall(needle, haystack)
print(x)

['Jingle', 'Jingle', 'Jingle']

\B: This sequence specifies that the specified characters must be present, but not at the word's beginning or end.

import re
haystack = r"Neversayimpossible. Because it is..."
needle = r"\Bimpossible"
x = re.findall(needle, haystack)
print(x)

['impossible']

\d: This sequence will return the matching string where digits are encountered(0-9).

import re
haystack = r"I have 2 credit cards."
needle = r"\d"
x = re.findall(needle, haystack)
print(x)

['2']

\D: This sequence will return a string that DOES not contain any digits.

import re
haystack = r"I have 2 credit cards."
needle = r"\D"
x = re.findall(needle, haystack)
x = "".join(x)
print(x)

I have credit cards.

\s: This sequence will return all the white-space characters.

import re
haystack = r"I have 2 credit cards."
needle = r"\s"
x = re.findall(needle, haystack)
print(x)

[' ', ' ', ' ', ' ']

\S: This sequence will return all the non-whitespace characters.

import re
haystack = r"I have 2 credit cards."
needle = r"\S"
x = re.findall(needle, haystack)
x = "".join(x)
print(x)

Ihave2creditcards.

\w: This sequence will return a string containing only a-z, A-Z, 0-9, and the underscore(_) character.

import re
haystack = r"I_have_2_credit_cards!!!"
needle = r"\w"
x = re.findall(needle, haystack)
x = "".join(x)
print(x)

I_have_2_credit_cards

\W: This sequence will return a string that will NOT contain a-z, A-Z, 0-9, and the underscore(_) character.

import re
haystack = r"I_have_2_credit_cards!!!"
needle = r"\W"
x = re.findall(needle, haystack)
x = "".join(x)
print(x)

!!!

\Z: This sequence will return the specified string if it is at the end.

import re
haystack = r"Jingle Bell"
needle = r"Bell\Z"
x = re.findall(needle, haystack)
print(x)

['Bell']

Sets

Set	Description
[chars]	A match will be returned if the characters specified inside the [] are present in the string.
[char_range_begin-char_range_end]	A match will be returned if the characters range specified inside the [] are present in the string.
[^chars]	A match will be returned except for the characters specified.
[digits]	A match will be returned of the digits specified.
[digit_begin-digit_end]	A Match will be returned for the digit range specified.
[digit_begin-digit_end][digit_begin-digit_end]	A match will be returned for the subsequent digits following the pattern .
[char_begin-char_endchar_begin-char_end]	A Match will be returned if it fits the specified range.

[chars]: A match will be returned if the characters specified inside the [] are present in the string.

import re
haystack = r"ABCDEFGHIJKL"
needle = r"[abc]"
x = re.findall(needle, haystack)
print(x)

[]

The above example will return an empty List because the lowercase a, b, c characters are not present.

import re
haystack = r"ABCDEFGHIJKL"
needle = r"[abcJKL]"
x = re.findall(needle, haystack)
print(x)

['J', 'K', 'L']

The above example, however, returns J, K, L characters.

[char_range_begin-char_range_end]: A match will be returned if the characters range specified inside the [] are present in the string.

import re
haystack = r"ABCDEFGHIJKL"
needle = r"[0-9]"
x = re.findall(needle, haystack)
print(x)

[]

[^chars]: A match will be returned except for the characters specified.

import re
haystack = r"ABCDEFGHIJKL"
needle = r"[^ABC]"
x = re.findall(needle, haystack)
print(x)

['D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']

[digits]: A match will be returned of the digits specified.

import re
haystack = r"1234567890"
needle = r"[02468]"

# Return all even digits
x = re.findall(needle, haystack)
print(x)

['2', '4', '6', '8', '0']

[digit_begin-digit_end]: A Match will be returned for the digit range specified.

import re
haystack = r"1234567890"
needle = r"[0-4]"
x = re.findall(needle, haystack)
print(x)

['1', '2', '3', '4', '0']

[digit_begin-digit_end][digit_begin-digit_end]: A match will be returned for the subsequent digits following the pattern .

import re
haystack = r"207821884220230924"
needle = r"[2][0-4]"
x = re.findall(needle, haystack)
print(x)

['20', '21', '22', '23', '24']

You are allowed to put N number of subsequent cases.

[char_begin-char_endchar_begin-char_end]: A Match will be returned if it fits the specified range.

import re
haystack = r"Ahdbdh909kd0!kjd9"
needle = r"[a-zA-Z]"
x = re.findall(needle, haystack)
print(x)

['A', 'h', 'd', 'b', 'd', 'h', 'k', 'd', 'k', 'j', 'd']

Meta Characters

Meta Character	Description
[]	Square Bracket describes a set of characters.
\	This character is used to specify a special sequence. We can also use this to escape special characters such as the character '\' itself.
.	This character dot(.) specifies any character, except for the newline character.
^	This carat(^) character is used to specify that the string should start with the pattern specified.
$	This dollar($) character is used to specify that the string should end with the pattern specified.
*	This asterisk(*) character will search for zero or more occurrences.
+	This character will search for one or more occurrences.
{}	In this, specify the exact number of occurrences.
\|	The pipe(\|) character specifies either of the pattern set.
()	This is used for grouping

[]: Square Bracket describes a set of characters.

import re
haystack = "Hello, World is my favourite sentence"
needle = "[a-d]"
x = re.findall(needle, haystack)
print(x)

['d', 'a', 'c']

In this example, we have specified to find all the characters within the specified set: a-d(lowercase).

\: This character is used to specify a special sequence. We can also use this to escape special characters such as the character '\' itself. We have covered this later in detail, as this can be pretty confusing for beginners. You can read it here.

import re
haystack = "Hello, World is 1 of my favourite sentence"
needle = "\d"
x = re.findall(needle, haystack)
print(x)

['1']

The above example searches for all the occurrences of a digit(not a number), as the below example demonstrates.

import re
haystack = "Hello, World is 21 of my favourite sentence"
needle = "\d"
x = re.findall(needle, haystack)
print(x)

['2', '1']

.: This character dot(.) specifies any character, except for the newline character.

import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = "anon...us"
x = re.search(needle, haystack)
if x:
print(x.group())

anonymous

This will search for a word starting with 'anon', and then having three non-newline characters, and ending with the characters 'us'.

^: This carat(^) character is used to specify that the string should start with the pattern specified.

import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = r"^C:\\Users"
x = re.search(needle, haystack)
if x:
print(x.group())

C:\Users

$: This dollar($) character is used to specify that the string should end with the pattern specified.

import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = r"my_folder$"
x = re.search(needle, haystack)
if x:
print(x.group())

my_folder

# This is not available
import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = r"music$"
x = re.search(needle, haystack)
if x:
print(x.group())

The above example produces no output.

*: This asterisk(*) character will search for zero or more occurrences.

import re
haystack = r"Hello!. Welcome to Hell!"
needle = "ello*"
x = re.findall(needle, haystack)
print(x)

['ello', 'ell']

Here, the string must have the substring "ell" with zero or more 'o' characters.
Both ello and ell fit the pattern specified.

+: This character will search for one or more occurrences.

import re
haystack = r"Hello!. Hello, World is my favourite sentence."
needle = "preferred+"
x = re.findall(needle, haystack)
print(x)

[]

{}: In this, specify the exact number of occurrences.

import re
haystack = r"Hello!. Hello, World is my favourite sentence."
needle = "ello{1}"
x = re.findall(needle, haystack)
print(x)

['ello', 'ello']

This example mentions that the substring must have at least 1 'o' character. In the above string, the word Hello ends with ello.

|: The pipe(|) character specifies either of the pattern set.

import re
haystack = r"Hello!. Hello, World is my favourite sentence."
needle = "favourite|preferred"
x = re.findall(needle, haystack)
print(x)

haystack = r"Hello!. Hello, World is my preferred sentence."
needle = "absent|preferred"
x = re.findall(needle, haystack)
print(x)

['favourite']
['preferred']

(): This is used for grouping .

import re
haystack = r"My favourite car is Honda-CRV8."
needle = "Honda-CRV(\d)"
x = re.search(needle, haystack)
if x:
print(x.group())

Honda-CRV8

RegEx Methods

findall()

The findall method in the RegEx module will find all the matching content and return them as a List.

import re
haystack = "Hello, World is my favourite sentence."
x = re.findall("World", haystack)
print(x)

['World']

If the findall method cannot find a match, it will return an empty List. Below is an example:

import re
haystack = "Hello, World is my favourite sentence."
x = re.findall("primary", haystack)
print(x)

[]

search()

The search method in RegEx's re module will search the string for a match. If the search() method finds the matching pattern, it will return a Match object.

import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = "anonymous"
x = re.search(needle, haystack)
if x:
s, e = x.span()
print(haystack[s:e])

anonymous

split()

The split method will return a list containing the characters with a split of the matching character(s).

import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = "\\\\"
x = re.split(needle, haystack)
print(x)

['C:', 'Users', 'anonymous', 'Desktop', 'my_folder']

Let us break down the most confusing statement of the example:

needle = "\\\\"

If you have read about strings, you would know that there are escape characters. There are special escape characters for your reference here.

Now, I'll explain how '\\\\' transforms into '\'. To escape the backspace('\') character, we need to write it as '\\'; this would work in a normal Python string. However, we are using regex to process our string, and it has its own rules that we have to follow correctly. When considering a normal string, '\\' would transform into '\' and be processed as such, but regex would not be able to understand a single '\'; it needs '\\' so that after processing, it becomes a single '\'. Therefore, '\\\\' would transform to '\\', which is understood by regex engine and it will transform '\\' into a single '\'.

\\\\  (Becomes) ---> \\

\\    (Becomes) ---> \

sub()

The sub method stands for substitution. Hence, as the name suggests, this method will replace the given pattern with a specified replacement.

import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = "anonymous"
x = re.sub(needle, "no_name", haystack)
print(x)

C:\Users\no_name\Desktop\my_folder

Match Object

In Python's RegEx, a match object contains information about the search and the result. If there is no match, then None is returned.

The match object has three methods/properties:

span()

The span() method will return the start and end position of the matching string.

import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = "anonymous"
x = re.search(needle, haystack)
if x:
start, end = x.span()
print(haystack[start:end])

anonymous

string

The string property will return the string passed into the method.

import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = "anonymous"
x = re.search(needle, haystack)
if x:
print(x.string)

C:\Users\anonymous\Desktop\my_folder

group()

The group() method will return the matching string.

import re
haystack = r"C:\Users\anonymous\Desktop\my_folder"
needle = "anonymous"
x = re.search(needle, haystack)
if x:
print(x.group())

anonymous

Conclusion

String processing is an essential element of almost all applications. Hence, having powerful string processing methods such as Regular Expressions(RegEx) ease the development process. This tutorial taught us various rules, syntax, and special characters to find and extract strings based on a pattern.

File Handling

Built In