0% found this document useful (0 votes)
95 views

Module 4 - Reading5 - UniformResourceLocator

A uniform resource locator (URL) specifies the location and retrieval method of a resource on a computer network. A URL identifies a web page, file, email address, or other application. It consists of components like the protocol, domain name, path, port number, query string, and fragment identifier. URLs were standardized in 1994 and allow resources to be located and accessed across the internet.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Module 4 - Reading5 - UniformResourceLocator

A uniform resource locator (URL) specifies the location and retrieval method of a resource on a computer network. A URL identifies a web page, file, email address, or other application. It consists of components like the protocol, domain name, path, port number, query string, and fragment identifier. URLs were standardized in 1994 and allow resources to be located and accessed across the internet.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Reading: Uniform Resource Locator

Introduction

A uniform resource locator (URL) is a reference to a


resource that specifies the location of the resource on a
computer network and a mechanism for retrieving it. A URL
is a specific type of uniform resource identifier
(URI), although many people use the two terms
interchangeably. A URL implies the means to access an
indicated resource, which is not true of every URI. URLs
occur most commonly to reference web pages (http), but are
also used for file transfer (ftp), email (mailto), database
access (JDBC), and many other applications.
Most web browsers display the URL of a web page above
the page in an address bar. A typical URL has the
form http://www.example.com/index.html, which indicates
the protocol type (http), the domain name,
(www.example.com), and the specific web page (index.html).
History
The Uniform Resource Locator was standardized in 1994 by
Tim Berners-Lee and the URI working group of the Internet
Engineering Task Force (IETF) as an outcome of
collaboration started at the IETF Living Documents “Birds of
a Feather” session in 1992. The format combines the pre-
existing system of domain names (created in 1985) with file
path syntax, where slashes are used to separate directory
and file names. Conventions already existed where server
names could be prepended to complete file paths, preceded
by a double-slash (//).
Berners-Lee later regretted the use of dots to separate the
parts of the domain name within URIs, wishing he had used
slashes throughout. For
example,http://www.example.com/path/to/name would have
been written http:com/example/www/path/to/name. Berners-
Lee has also said that, given the colon following the URI
scheme, the two slashes before the domain name were also
unnecessary.
Syntax
Every HTTP URL consists of the following, in the given
order. Several schemes other than HTTP also share this
general format, with some variation.
• the scheme name (commonly called protocol, although not
every URL scheme is a protocol, e.g. mailto is not a
protocol)
• a colon, two slashes,
• a host, normally given as a domain name For example,
http://www.example.com/path/to/name would have
been written http:com/example/www/path/to/name but
sometimes as a literal IP address
• optionally a colon followed by a port number
• the full path of the resource
The scheme says how to connect, the host specifies where
to connect, and the remainder specifies what to ask for.
For programs such as Common Gateway Interface (CGI)
scripts, this is followed by a query string, and an optional
fragment identifier.
The syntax is:
scheme://[user:password@]domain:port/path?
query_string#fragment_id
Component details:
• The scheme, which in many cases is the name of a
protocol (but not always), defines how the resource will
be obtained. Examples include http, https, ftp, file and
many others. Although schemes are case-insensitive,
the canonical form is lowercase.
• The domain name or literal numeric IP address gives the
destination location for the URL. A literal numeric IPv6
address may be given, but must be enclosed in [ ] e.g.
[db8:0cec::99:123a]. The domain google.com, or its
numeric IP address 173.194.34.5, is the address of
Google’s website.
• The domain name portion of a URL is not case sensitive
since DNS ignores case: http://en.example.org/ and
HTTP://EN.EXAMPLE.ORG/ both open the same page.
• The port number, given in decimal, is optional; if omitted,
the default for the scheme is used. For example,
http://vnc.example.com:5800 connects to port 5800 of
vnc.example.com, which may be appropriate for a VNC
remote control session. If the port number is omitted for
an http: URL, the browser will connect on port 80, the
default HTTP port. The default port for an https: request
is 443.
• The path is used to specify and perhaps find the resource
requested. This path may or may not describe folders
on the file system in the web server. It may be very
different from the arrangement of folders on the web
server. It is case-sensitive, though it may be treated as
case-insensitive by some servers, especially those
based on Microsoft Windows. If the server is case
sensitive and http://en.example.org/wiki/URL is correct,
then http://en.example.org/WIKI/URL or
http://en.example.org/wiki/url will display an HTTP 404
error page, unless these URLs point to valid resources
themselves.
• The query string contains data to be passed to software
running on the server. It may contain name/value pairs
separated by ampersands, for example ?
first_name=John&last_name=Doe.
• The fragment identifier, if present, specifies a part or a
position within the overall resource or document. When
used with HTML, it usually specifies a section or
location within the page, and used in combination with
Anchor elements or the “id” attribute of an element, the
browser is scrolled to display that part of the page.
The scheme name defines the namespace, purpose, and the
syntax of the remaining part of the URL. Software will try to
process a URL according to its scheme and context. For
example, a web browser will usually dereference the URL
http://example.org:80 by performing an HTTP request to the
host at example.org, using port number 80.
Other examples of scheme names include https, gopher,
wais, ftp. URLs with https as a scheme (such as
https://example.com/) require that requests and responses
will be made over a secure connection to the website. Some
schemes that require authentication allow a username, and
perhaps a password too, to be embedded in the URL, for
exampleftp://[email protected]. Passwords
embedded in this way are not conducive to security, but the
full possible syntax is
scheme://username:password@domain:port/path?
query_string#fragment_id
Other schemes do not follow the HTTP pattern. For
example, the mailto scheme only uses valid email
addresses. When clicked on in an application, the URL
mailto:[email protected] start an e-mail composer with
the address [email protected] in the To field. The tel
scheme is even more different; it uses the public switched
telephone network for addressing, instead of domain names
representing Internet hosts.
List of allowed URL characters
Unreserved
The alphanumerical upper and lower case character may
optionally be encoded:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789–_.~
Reserved
Special symbols must sometimes be percent-encoded:
! * ‘ ( ) ; : @ & = + $ , / ? % # [ ]
Further details can for example be found in RFC 3986 and
http://www.w3.org/Addressing/URL/uri-spec.html.
Relationship to URI
A URL is a URI that, in addition to identifying a web
resource, provides a means of locating the resource by
describing its “primary access mechanism (e.g., its network
location)”.
Internet hostnames
A hostname is a domain name assigned to a host computer.
This is usually a combination of the host’s local name with its
parent domain’s name. For example, en.example.org
consists of a local hostname (en) and the domain name
example.org. The hostname is translated into an IP address
via the local hosts file, or the domain name system (DNS)
resolver. It is possible for a single host computer to have
several hostnames; but generally the operating system of
the host prefers to have one hostname that the host uses for
itself.
Any domain name can also be a hostname, as long as the
restrictions mentioned below are followed. For example, both
“en.example.org” and “example.org” can be hostnames if
they both have IP addresses assigned to them. The domain
name “xyz.example.org” may not be a hostname if it does
not have an IP address, but “aa.xyz.example.org” may still
be a hostname. All hostnames are domain names, but not all
domain names are hostnames.
URL protocols
The protocol, or scheme, of a URL defines how the resource
will be obtained. Two common protocols on the web are
HTTP and HTTPS. For various reasons, many sites have
been switching to permitting access through both the HTTP
and HTTPS protocols. Each protocol has advantages and
disadvantages, including for some of the users that one or
the other protocol either does not function, or is very
undesirable. When a link contains a protocol specifier it
results in the browser following the link using the specified
protocol regardless of the potential desires of the user.
Protocol-relative URLs
It is possible to construct valid URLs without specifying a
protocol which are called protocol-relative links (PRL) or
protocol-relative URLs. Using PRLs on a page permits the
viewer of the page to visit new pages using whichever
protocol was used to obtain the page containing the link.
This supports continuing to use whichever protocol the
viewer has chosen to use for obtaining the current page
when accessing new pages.
An example of a PRL is //en.wikipedia.org/wiki/Main_Page
which is created by removing the protocol prefix.
Internationalized URL
Internet users are distributed throughout the world using a
wide variety of languages and alphabets. Users expect to be
able to create URLs in their own local alphabets.
An internationalized resource identifier (IRI) is a form of URL
that includes Unicode characters. All modern browsers
support IRIs. The parts of the URL requiring special
treatment for different alphabets are the domain name and
path.
The domain name in the IRI is known as an internationalized
domain name (IDN). Web and Internet software
automatically convert the domain name into punycode
usable by the Domain Name System.
For example, the Chinese web site http://見.香港 becomes
the following for DNS lookup. xn-- indicates the character
was not originally ASCII.
http://xn--nw2a.xn--j6w193g/
The URL path name can also be specified by the user in the
local alphabet. If not already encoded, it is converted to
Unicode, and any characters not part of the basic URL
character set are converted to English letters using percent-
encoding.
For example, the following Japanese Web page
http://domainname/引き割り.html becomes
http://domainname/%E5%BC%95%E3%81%8D
%E5%89%B2%E3%82%8A.html. The target computer
decodes the address and displays the page.

You might also like