
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Extract Title from a Webpage Using Python
In Python, we can extract the title from a webpage using Web scraping. Web scraping is the process of extracting data from a website or webpage. In this article, we will scrap the title of a webpage using the Requests and BeautifulSoup libraries in Python.
Extracting Title from Webpage
Method 1: Using Request and Beautiful Soup libraries
We can use the request and Beautiful Soup libraries of Python to extract titles from a webpage. The requests library is used to send HTTP requests to a website and get its response. We then use the response object to extract the HTML content of the webpage.
Example
In the below example, we extract the title of the Wikipedia Homepage. We send the GET request to the Wikipedia page URL using the request library and store the response object in the response variable.
We can then use the Beautiful Soup object to parse the HTML content received in the response object and extract the title tag of the webpage using the soup.title attribute. We can then extract the string attribute and store it in the title variable.
import requests from bs4 import BeautifulSoup url = 'https://www.wikipedia.org/' response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') title = soup.title.string print(title)
Output
Wikipedia
Method 2: Extracting title using urllib and BeautifulSoup
The urllib and BeautifulSoup method is used to extract the title from a webpage by opening the URL and retrieving the HTML content of the webpage using the urllib library. A BeautifulSoup object is created using the HTML content, and the title tag of the webpage can be extracted using the 'soup.title' attribute.
Example
In the below example, we are using the urllib library to open the URL and retrieve the HTML content of the webpage. We then create a BeautifulSoup object with the HTML content of the webpage using the 'html.parser' parser.
We can then extract the title tag of the webpage using the 'soup.title' attribute. Finally, we extract the string content of the title tag using the 'string' attribute and store it in the 'title' variable. We then print the title of the webpage to the console.
from urllib.request import urlopen from bs4 import BeautifulSoup url = 'https://www.wikipedia.org/' html_page = urlopen(url) soup = BeautifulSoup(html_page, 'html.parser') title = soup.title.string print(title)
Output
Wikipedia
Method 3: Extracting title using selenium and BeautifulSoup
The selenium and BeautifulSoup method is used to extract the title from a webpage by using the selenium library to open the URL and retrieve the HTML content of the webpage. A Chrome webdriver is created and used to navigate to the webpage. The HTML content of the webpage is retrieved using the 'page_source' attribute of the webdriver. A BeautifulSoup object is created using the HTML content, and the title tag of the webpage can be extracted using the 'soup.title' attribute.
Example
In the below example, we are using the selenium library to open the URL and retrieve the HTML content of the webpage. We create a Chrome webdriver and use it to navigate to the webpage. We then retrieve the HTML content of the webpage using the 'page_source' attribute of the webdriver.
We create a BeautifulSoup object with the HTML content of the webpage using the 'html.parser' parser. We can then extract the title tag of the webpage using the 'soup.title' attribute. Finally, we extract the string content of the title tag using the 'string' attribute and store it in the 'title' variable. We then print the title of the webpage to the console.
from selenium import webdriver from bs4 import BeautifulSoup url = 'https://www.wikipedia.org/' driver = webdriver.Chrome() driver.get(url) html_page = driver.page_source soup = BeautifulSoup(html_page, 'html.parser') title = soup.title.string print(title) driver.quit()
Output
Wikipedia
Method 4: Extracting title using regular expressions
The regular expressions method is used to extract the title from a webpage by sending a GET request to the URL using the requests library and storing the response object. The HTML content of the webpage is then decoded and stored in a variable. A regular expression pattern is defined to match the title tag of the webpage. The 'search' method of the regular expression pattern is used to find the first match of the pattern in the HTML content of the webpage. The string content of the first matched group can be extracted using the 'group(1)' method, and the title of the webpage can be obtained.
Example
In the below example, we are using regular expressions to extract the title of the webpage. We send a GET request to the URL using the requests library and store the response object in the 'response' variable.
We then decode the HTML content of the webpage using the 'utf-8' encoding and store it in the 'html_content' variable. We define a regular expression pattern to match the title tag of the webpage.
We use the 'search' method of the regular expression pattern to find the first match of the pattern in the HTML content of the webpage. We extract the string content of the first matched group using the 'group(1)' method and store it in the 'title' variable. We then print the title of the webpage to the console.
import requests url = 'https://www.wikipedia.org/' response = requests.get(url) html_content = response.content.decode('utf-8') title_pattern = re.compile('(.+?) ') match = title_pattern.search(html_content) title = match.group(1) print(title)
Output
Wikipedia
Conclusion
In this article, we discussed the process to extract the title from a webpage using requests and the Beautiful Soup library in Python. The request library is used to send HTTP requests to the website URL and get the HTML content as a response. The Beautiful Soup library is then used to parse the HTML content and extract the required title from the HTML content.