Shell Script To Show All the Internal and External Links From a URL
Last Updated :
22 Feb, 2023
To make hierarchy among the webpages or the information web developers use to connect them all is called webpage linking. There are two types of webpage linking: one is internal linking, and the other is external linking. Internal links are those which link a page available on the same website to produce a cycle on the site. At the same time, external links are those which link to another website or domain. External links play a vital role in ranking a website on the search engine. Improvement in the website rank can be seen by increasing the number of external links to your website. Here we are asked to code a shell script that could print all these links on the terminal screen. The only input provided to the script is the URL of the webpage for which we need to fetch all the links.
Note: A website can be accessed in two ways: one is using a web browser, and the other is using terminal commands which follow limited protocols to access the website. Terminal commands have some limitations, so we will also use a terminal-based web browser, which will help us to connect to that website.
CLI:
For the command line, we are going to use the tool "lynx". Lynx is a terminal-based web browser that did not show images and other multimedia content to make it much faster than other browsers.
# sudo apt install lynx -y
Install lynx terminal browser
Let us see the GeeksForGeeks project page links. But before we must understand the options present in the lynx browser.
- -dump: This will dump the formatted output of the document.
- -listonly: This will list all the links present on the URL mentioned. This used with -dump.
Now apply these options:
# lynx -dump -listonly https://www.geeksforgeeks.org/computer-science-projects/
dump all links on terminal
Or redirect this terminal output to any text file:
# lynx -dump -listonly https://www.geeksforgeeks.org/computer-science-projects/ > links.txt

Now see the links using cat commands:
# cat links.txt

Shell Script
We could easily do all the work done above in a file using a scripting language, and it would be much easier and enjoyable as well. There are different ways to get the links, like regex. We will use regex with the "sed" command. First, we will download the webpage as text and then apply the regular expression on the text file.
Now we will create a file using the nano editor. Code explanation is given below.
# nano returnLinks.sh

Below is the implementation:
#!/bin/bash
# Give the url
read urL
# wget will now download this webpage in the file named webpage.txt
# -O option is used to concat the content of the url to the file mentioned.
wget -O webpage.txt "$urL"
# Now we will apply stream editor to filter the url from the file.
sed -n 's/.*href="\([^"]*\).*/\1/p' webpage.txt
Give permission to the file:
To execute a file using terminal we first make it executable by changing the accessibility modes of the file. Here 777 represents read, write, and executable. There are some other permissions that could be used to limit the files.
# chmod 777 returnLinks.sh
Now execute the shell script and give the URL:
# ./returnLinks.sh
shell script returns links
You can also store this in an external file as well:
The script will be the same; only the output redirection will be added to the stream editor command so that the output can be stored in the file.
#!/bin/bash
#Give the url
read urL
#wget will now download this webpage in the file named webpage.txt
wget -O webpage.txt "$urL"
#Now we will apply stream editor to filter the url from the file.
# here we will use output redirection to a text file. All the other code is same.
sed -n 's/.*href="\([^"]*\).*/\1/p' webpage.txt > links.txt

Now open the file links.txt
We will now open the file and see if all the links are present in the file or not.
# cat links.txt

Similar Reads
Shell Script to traverse all internal URLs and reporting any errors in the "traverse.errors" file
If you are using a web server or are responsible for a website, either simple or complex, you probably find yourself doing certain tasks with high frequency, significantly identifying broken internal and external site links. Using shell scripts, you can create many of these tasks, as well as other n
2 min read
Shell Script To Show Names of All Sub-Directories Present in Current Directory
In this given program, we are required to write a shell script to print the list of all the sub-directories present in the current directory. Currently, when we are running this program, the terminal is opened at the root directory location and hence, we are getting a list of sub-directories present
2 min read
List out all the Shells Using Linux Commands
When you're operating out of a Linux environment, the shell is your primary tool for interacting with the OS. It's your command interpreter â it translates what you type into what the OS can interpret and carry out. From basic operations like looking at files to running complex scripts, the shell ma
4 min read
Shell Script to List All IP
IP addresses can be thought of as house addresses. Just like we have a unique house address for each house, similarly, we have a unique IP address for each unique device on the network. An IP address is used to uniquely identify a device on an internet protocol network.Example of an IP address:192.0
3 min read
How to Use ln Command to Create Symbolic Links in Linux
A symbolic link (symlink) is like a shortcut that points to a file or folder on Linux and other similar operating systems. Symlinks can be useful for organizing files and folders, or for making it easier to access files from different locations. In this guide, you'll learn how to create symbolic lin
5 min read
Automater â IP and URL OSINT Tool For Analysis
OSINT processes on the IP addresses and Domains are be performed using automated tools. Automater is a fully automated tool that can perform OSINT search on IP addresses, MD5 Hash, and also Domain Addresses. Automater tool returns the relevant results from trusted sources which are IPvoid.com, Robte
3 min read
Breacher - Tool To Find Admin Login Pages And EAR Vulnerabilities
Breacher is a free and open-source tool available on GitHub. Breacher is used as an information-gathering tool. This tool can be used to get information about our target(domain). We can target any domain using Breacher. The interactive console provides a number of helpful features, such as command c
3 min read
gau (GetAllUrls) Review - Tool For Discovering URL in Kali Linux
Gau stands for get all URLs. It is a free and open-source tool available on GitHub which is used for extracting all the URLs from a given domain. This tool is free means you can download and use this tool free of cost. It is used for reconnaissance of subdomains. Gau is used for finding the URLs of
2 min read
Shellfinder - Simple Tool to Find Shells and Endpoints in Websites
A shell is a malicious PHP file executed by accessing it via a web browser. It is a PHP script allowing the attacker to control the server - essentially a backdoor program, similar in functionality to a Trojan for personal computers. Shellfinder tool finds the route through which this malicious file
2 min read
Using Lynx to Browse the Web From the Linux Terminal
Lynx is a terminal-based web browser for all Linux distributions. It shows the result as plain text on the terminal. It is a classic non-graphical text-mode web browser which displays the pages on the terminal. Lynx does not load images and multimedia contents, hence it is fast as compared to other
2 min read