
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Extract URLs Present in a Given String
In the information age, it's common to encounter strings of text that contain URLs. As part of data cleaning or web scraping tasks, we often need to extract these URLs for further processing. In this article, we'll explore how to do this using C++, a high-performance language that offers fine-grained control over system resources.
Understanding URLs
A URL (Uniform Resource Locator) is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. In layman's terms, a URL is a web address.
Problem Statement
Given a string that contains several URLs, our task is to extract all the URLs present in the string.
Solution Approach
To solve this problem, we'll use the regular expression (regex) support in C++. Regular expressions are sequences of characters that define a search pattern, mainly for use in pattern matching with strings.
The steps involved in our approach are ?
Define a Regex Pattern: Define a regex pattern that matches the general structure of a URL.
Match and Extract: Use the regex pattern to match and extract all URLs present in the given string.
C++ Implementation
Example
Here's the C++ code that implements our solution ?
#include <bits/stdc++.h> using namespace std; // Function to extract all URLs from a string vector<string> extractURLs(string str) { vector<string> urls; regex urlPattern("(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?"); auto words_begin = sregex_iterator(str.begin(), str.end(), urlPattern); auto words_end = sregex_iterator(); for (sregex_iterator i = words_begin; i != words_end; i++) { smatch match = *i; string match_str = match.str(); urls.push_back(match_str); } return urls; } int main() { string str = "Visit https://www.tutorialspoint.com and http://www.tutorix.com for more information."; vector<string> urls = extractURLs(str); cout << "URLs found in the string:" << endl; for (string url : urls) cout << url << endl; return 0; }
Output
URLs found in the string: https://www.tutorialspoint.com and http www.tutorix.com for more information.
Explanation
Let's consider the string ?
str = "Visit https://www.tutorialspoint.com and http://www.tutorix.com for more information."
After applying our function to this string, it matches the two URLs and extracts them into a vector:
urls = ["https://www.tutorialspoint.com", "http://www.tutorix.com"]
This vector is the output of our program.
Conclusion
The task of extracting URLs from a string provides valuable insights into text processing and the use of regular expressions. This problem-solving approach, along with the C++ programming skills it requires, is highly useful in the fields of data analysis, web scraping, and software development.