
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Python Regular Expression Syntax Explained Simply
You will study regular expressions (RegEx) in this blog and interact with RegEx using Python's re-module (with the help of examples).
A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. For example,
^a...s$
A RegEx pattern is defined by the code above. Any five-letter string with an a and a s at the end forms the pattern.
Python has a module named re to work with RegEx. Here's an example ?
import re pattern = '^a...s$' test_string = 'abyss' result = re.match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")
Different types of Syntax used for these operations
re.findall()
The re.findall() method returns a list of strings containing all matches.
Example
Program to extract numbers from a string
import re string = 'hello 12 hi 89. Howdy 34' pattern = '\d+' print("Entered String=",string) result = re.findall(pattern, string) print("The numbers in the above string",result)
Output
Entered String= hello 12 hi 89. Howdy 34 The numbers in the above string ['12', '89', '34']
If the pattern is not found, re.findall() returns an empty list.
re.search()
The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.
If the search is successful, re.search() returns a match object; if not, it returns None.
String
match = re.search(pattern, str)
Example
import re string = "Python is fun" #check if 'Python' is at the beginning match = re.search('\APython', string) if match: print("pattern found inside the string") else: print("pattern not found")
Output
pattern found inside the string
Here, match contains a match object.
re.subn()
The re.subn() is similar to re.sub() except it returns a tuple of 2 items containing the new string and the number of substitutions made.
Example
#Program to remove all whitespaces import re #multiline string string = 'abc 12\ de 23 \n f45 6' print("Orginal String =",string) #matches all whitespace characters pattern = '\s+' #empty string replace = '' new_string = re.subn(pattern, replace, string) print("New String=",new_string)
Output
Orginal String = abc 12de 23 f45 6 New String= ('abc12de23f456', 4)
re.split()
re.split delivers a list of strings where the splits have taken place after splitting the string where a match exists.
Example
import re string = 'Twelve:12 Eighty nine:89.' pattern = '\d+' result = re.split(pattern, string) print(result)
Output
['Twelve:', ' Eighty nine:', '.']
If the pattern is not found, re.split() returns a list containing the original string.
You can pass maxsplit argument to the re.split() method. It's the maximum number of splits that will occur.
Example
import re string = 'Twelve:12 Eighty nine:89 Nine:9.' pattern = '\d+' //maxsplit = 1 //split only at the first occurrence result = re.split(pattern, string, 1) print(result)
Output
['Twelve:', ' Eighty nine:89 Nine:9.']
By the way, the default value of maxsplit is 0; meaning all possible splits.
re.sub()
The syntax of re.sub() is ?
re.sub(pattern, replace, string)
The method returns a string where matched occurrences are replaced with the content of replace variable.
Example
#Program to remove all whitespaces import re #multiline string string = 'abc 12\ de 23 \n f45 6' #matches all whitespace characters pattern = '\s+' #empty string replace = '' new_string = re.sub(pattern, replace, string) print(new_string)
Output
abc12\de23f456
If the pattern is not found, re.sub() returns the original string.
You can pass count as a fourth parameter to the re.sub() method. If omited, it results to 0. This will replace all occurrences.
Example
import re #multiline string string = "abc 12\ de 23 \n f45 6" #matches all whitespace characters pattern = '\s+' replace = '' new_string = re.sub(r'\s+', replace, string, 1) print(new_string)
Output
abc12de 23 f45 6
Match object
You can get methods and attributes of a match object using dir() function.
Some of the commonly used methods and attributes of match objects are ?
match.group()
The group() method returns the part of the string where there is a match.
Example
import re string = '39801 356, 2102 1111' #Three digit number followed by space followed by two digit number pattern = '(\d{3}) (\d{2})' #match variable contains a Match object. match = re.search(pattern, string) if match: print(match.group()) else: print("pattern not found")
Output
801 35
Here, match variable contains a match object.
Our pattern (\d{3}) (\d{2}) has two subgroups (\d{3}) and (\d{2}). You can get the part of the string of these parenthesized subgroups. Here's how ?
>>> match.group(1) '801' >>> match.group(2) '35' >>> match.group(1, 2) ('801', '35') >>> match.groups() ('801', '35') match.start(), match.end() and match.span() The start() function returns the index of the start of the matched substring. Similarly, end() returns the end index of the matched substring. >>> match.start() 2 >>> match.end() 8 The span() function returns a tuple containing start and end index of the matched part. >>> match.span() (2, 8) match.re and match.string The re attribute of a matched object returns a regular expression object. Similarly, string attribute returns the passed string. >>> match.re re.compile('(\d{3}) (\d{2})') >>> match.string '39801 356, 2102 1111'
We have covered all commonly used methods defined in the re module. If you want to learn more, visit Python 3 re module.
Using r prefix before RegEx
When r or R prefix is used before a regular expression, it means raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.
Backlash \ is used to escape various characters including all metacharacters. However, using r prefix makes \ treat as a normal character.
Example
import re string = '\n and \r are escape sequences.' result = re.findall(r'[\n\r]', string) print(result)
Output
['\n', '\r']
Conclusion
Therefore, these are the most fundamental and crucial Regular expressions concepts that we have attempted to explain using some engaging examples. Some of them were made up, but most were real problems we encountered while cleaning up our data, so in the future, if you run into a problem, just review the examples again; you may find the solution there.