
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Match Pattern Over Multiple Lines in Python
Learning Python's Regular Expressions (Regex) may require you to match text that has multiple lines. This can happen when you want to read information from a file or scrape data from a website.
This chapter will show you how to match patterns across several lines using Python's re module, which allows you to work with regular expressions (regex).
What is a Regular Expression?
A regular expression is a group of characters that allows you to use a search pattern to find a string or a set of strings. RegEx is another name for regular expressions. Here is a simple overview of some common regular expression symbols -
-
.: It is used to match any character except a newline.
-
*: It is used to match zero or more occurrences of the preceding element.
-
+: It is used to match one or more occurrences of the preceding element.
-
?: It is used to match zero or one occurrence of the preceding element.
-
[]: It is used to match any single character from the set.
-
(): It is a group's patterns for applying quantifiers or for capturing.
Handling Multi-line Strings
Generally, in regular expression search '.' special character does not match newline characters. To fix this issue, 're' packages provide a few predefined flags that modify how the special characters behave. So, by using re.DOTALL flag, we can match patterns with multiple lines.
Example
In the following example code, we match the span of a paragraph by using a regular expression. We begin by importing the regular expression module.
Then, we have used search() function, which is imported from the re module. This re.search() function searches the string/paragraph for a match and returns a match object if there is a match. The group() method is used to return the part of the string that is matched.
import re paragraph = \ ''' <p> Tutorials point is a website. It is a platform to enhance your skills. </p> ''' match = re.search(r'<p>.*</p>', paragraph, re.DOTALL) print(match.group(0))
The following output is obtained on executing the above program -
<p> Tutorials point is a website. It is a platform to enhance your skills. </p>
More Complex Patterns
You can create progressively complex patterns as needed. To find any lines that start with "It," are followed by any text, and end with a period, for example, you can use the following -
import re multi_line_text = """Hello world. It is a beautiful day. This is a test. It will work!""" pattern = r'(^It.*\.)' matches = re.findall(pattern, multi_line_text, re.M) print("Matches found:", matches)
The following output is obtained on executing the above program -
Matches found: ['It is a beautiful day.']
Using re.MULTILINE
This flag can be useful when you want ^ and $ to match the start and finish of each line, rather than just the full string.
import re text = """Hello world. It is a beautiful day. It will work!""" matches = re.findall(r"^It is.*", text, re.MULTILINE) print(matches)
The following output is obtained on executing the above program -
['It is a beautiful day.']
Match a multiline HTML block
You can also use the re.DOTALL flag to match a multiline HTML block. For example, if you want to match a div block that contains multiple paragraphs.
In this example, the re.DOTALL flag is being used to match a div block that has many paragraphs. The pattern <div>.*</div> matches any content, including newlines, between the beginning and ending div tags.
import re html = """<div> <p>Hello</p> <p>Welcome</p> </div>""" pattern = r"<div>.*</div>" match = re.search(pattern, html, re.DOTALL) print(match.group())
Following is the output of the above program -
<div> <p>Hello</p> <p>Welcome</p> </div>