How to parse HTML in Ruby?

Last Updated : 01 Apr, 2024

We have many languages which are used to parse the html files. We have Python programming languages. In Python, we can parse the html files using the panda's library and the library which is beautiful soup. The Beautiful Soup library is mainly used for web scraping. Similarly, we can parse the HTML files in the ruby using a library called Nokogiri. The Nokogiri library in the ruby helps us to parse the html files more easily.

To work with the html files in the ruby language we should have a pre-built library called Nokogiri. We should type the following command to get the library installed for parsing the html files.

gem install nokogiri

The above command helps us to install the library to parse the HTML file

Table of Content

1. Extracting the tags from the HTML File
2. Extracting the tags from the URL
Conclusion:

1. Extracting the tags from the HTML File

In this Program, we will parse the HTML string using the Nokogiri library in the ruby language. Then we use the parse method to read the HTML string. Then we can extract the title of the HTML string using the parsed string along with the title.

Ruby

#Importing the nokigiri Library 
require 'nokogiri'
#Parsing the HTML Text using the Nokogiri Library 
html_text = "<title>MyFirstWebSite</title>"
#Extracting the title from the HTML text
html_title = Nokigiri::HTML.parse(html_text)
#Printing the title of the html
puts html_title.title

Output :

=> MyFirstWebSite

Program Explaination:

In the above program we have first imported the nokogiri library .
Then we have created a string with the html tags .
The string we have created should be passed to the parse() method in the Nokogiri .parse()
Then we have printed the title of the html text using the parsedstring object.title

2. Extracting the tags from the URL

In the program we have used the open-uri to read parse the html tags from the url of html file .Then we have extracted the title for the given url of a html file .

Let's consider a example file:

https://newpage.com

<html> 
<head>
  <title> MyFirstWebSite</title>
</head>
<body>
<h1> Hi </h1>
</body>
</html>

Program:

Ruby

require 'open-uri'
#Reading the html script from url
Nokogiri::HTML.parse(open('https://newpage.com')).title
#The above command will fetch us the title of the html page

Output :

=>MyFirstWebSite

Program Explaination:

In the above program we have imported the module open-uri in the ruby.
Then with the help of the Nokogiri library in the ruby programming language we have passed the url of the html file using the open method in the open-uri.
The open method is used to read the whole thing available in the html url.
Then with the help of the nokogiri we have printed the title of the of the html page.

Conclusion:

Generally we parse the data in the html files for the usage in the web scraping .The web scraping now a days has become one of the important concept in the data science and it is a part of the data wrangling in the python .So using the libraries in the ruby helps us to read the data in the html files very easily . so in this way the libraries such as the nokogiri and open-uri helps us to scrap the web and extract the data from the html files and even the urls and including the html strings.