1/31/2008
Prof. Reuven Aviv Tel Hai Academic College Department of Computer Science Topics in Data Communication
World Wide Web
Acknowledgements for slides: A. Tanenbaum, Computer Networks
The World Wide Web Architectural Overview Static Web Documents Dynamic Web Documents HTTP The HyperText Transfer Protocol Content Delivery Networks
1/31/2008
Architectural Overview
Architectural Overview
Client with Web browser program Server with Web Server and pages (html) Other servers with Web Servers and pages Links between pages
1/31/2008
Browser Operation when User clicks on a link B picks the URL from the clicked link B gets IP address of Web server from DNS B open TCP connection to the (IP, port 80) B sends a request for page (HTTP packet) W.S. sends the linked page (HTTP packet) Page is in html language B. closes TCP connection B. interpret html, displays page to user B fetches & presents images linked to the file
The Client Side
non html in page: PDF, GIF, JPEG, MP3, MPEG, ... Plug-ins: Code installed as an extension to the browser Code uses browser functions & v.v e.g. to supply the data to the plug-in Helper Applications, invoked by B as a separate process
Plug-in
Helper Application
1/31/2008
Server Side
Accepts TCP connection Gets name of requested file (HTTP packet) Gets the file (local disk) Sends back the file (HTTP packets) Release TCP connection To improve performance Maintain cache of files Multithreading
Multi-threaded Web Server
Front-end thread accept request, build record Pass record to a Working Thread All threads share memory , including the cache If page not in cache, WT initiates disk read
1/31/2008
Tasks of a Working Thread Resolving name of the file Authenticating client (another lecture) Perform access control on client Check the cache Fetch file from disk Determine MIME type of file This will be sent to the client Send reply to client Construct HTTP packet(s) Write in the Web Server log
What if the CPU cant handle the load?
Server Farm on a LAN
Problems Each Processing Node has its own cache P.N. specialize with certain files Both requests and replies via the Front-end
Solution?
1/31/2008
TCP Handoff
Front-end passes the TCP endpoint (IP, port) to the Processing Node Processing Node send page to Client
Normal
TCP Handoff
URLs Uniform Resource Locaters
URL provides answers to what?
What is the name of the page? What is the location of the page? How to access the page (which protocol)?
? ?
1/31/2008
Statelessness and Cookies
HTTP is request/reply; stateless But, server needs: to recognize users (registered?, adapt home page) to keep track of visited items (shopping cart) Cookies (small text files) keep that info. Stored at Client C:\Documents and Settings\aviv\Cookies Identified by domain name of the sending server
Cookies: Structure domain: where the cookie came from Path: root of the file tree related to cookie Content: variableName=value pairs. Anything Expires if set it is kept (persistent cookie) Secure: If set cookie is sent only to secure server
Usages?
1/31/2008
Using cookies
Casino server chooses which gambling option it presents Store Server puts items in cart in the cookie Web Portal server presents stock prices and Sport results sneaky.com records visits of UserID in certain pages pages include adds/banners/small pictures User not aware its browser visited sneaky.com User profile is built, maybe with name/password
HTML: Hypertext Markup Language
1/31/2008
HTML HyperText Markup Language
(b)
Text with markups instructions (formatting, links,) Instructions in form of pair of tags <h2> </h2>
Formatted Page Presented by browser
1/31/2008
Some HTML Tags
HTML Table
10
1/31/2008
HTML Input: Forms
Browser presents a web page with a form User fill the form Browser stores User inputs in variables Browser send the information via HTTP
HTML Input: Web page with a Form
(b)
11
1/31/2008
Browser Response
A possible response from the browser to the server with information filled in by the user. A string of name=value
Server passes the string to back-end script for processing (e.g. Perl script) Script writes to DB, might create new page
Automatic Processing of Web Pages
Need to process html web pages by programs E.g. Find a book that was published after 2000 Program searches page(s), which have no structure. Hard for program to understand if 2000 is a year or a price Idea: Build documents (pages) with structure that will be useful for program Describe a document by XML language to define named structures, sub-structures XML: eXtnsible Markup Language
12
1/31/2008
A simple Web page in XML
Hierarchical Structure We define a structure, named book_list Book_list: a list of three structures named book Book: three fields, each with name & value
A simple Web page in XML
A program can search for book_list.book.date >= 2002 How a browser will present this page to a user? Need an processor that creates from XML doc an HTML page with formatting tags Instructions for the processor are in another file Written in the eXtensible Style Language (XSL) Referenced in the XML file (at the top) Browsers include XML/XSL processor and do this automatically on given XML/XSL files
13
1/31/2008
eXtensible Style Language
XSL
Pure html
XSL language program
Server Side Dynamic Pages: CGI Script
14
1/31/2008
Dynamic Web Documents
Server Side Dynamic pages: Embedded PHP Web server calls the PHP interpreter before downloading test.php Web Server maintains info about the browser (OS type, ..) in the variable HTTP_USER_AGENT Php re-writes the page, inserting the value of HTTP_USER_AGENT
15
1/31/2008
Web Page With A Form PHP Script Processing Form data
User Input: Barbara, 24
Output from PHP Script html page
Client-Side Dynamic Pages: Embedded Javascript
16
1/31/2008
Server Side & Client Side Dynamic Pages
Client Side is faster. Used for local interaction with User
JavaScript is a full blown language
17
1/31/2008
Various ways to create and Display Content
Embedded Java Applets downloadable ActiveX control
HTTP Protocol
18
1/31/2008
HTTP Protocol (1)
Versions 1.0, 1.1 RFC 2616 Request Response Using TCP (port 80 on server side) Persistent connection (HTTP 1.1) Request: ASCII Response: RFC 822 MIME-like A general protocol for object oriented Apps Accessing functionality of Remote Objects Many but not all methods are Web specific E.g. GET Object (not necessary a file)
HTTP Protocol (2)
transaction oriented client/server protocol between Web browser (client) and Web server stateless each transaction treated independently flexible format handling client may specify supported formats
19
1/31/2008
Examples of HTTP Operation
Direct connection
Via Intermediary system(s)
Caching
Intermediary systems 1: Proxy process
Usage: Clients within organization must authenticate external Web Server. Proxy sits on the client side of the firewall (FW) a. Proxy authenticates server (e.g. passwd, cert) b. replies carry authentication data e.g. SSL header (encrypted hash of message) Proxy send requests to server & replies to clients Acts as a client in interacting with the server Acts as a server in interacting with clients
20
1/31/2008
Types of Intermediate HTTP Systems
Intermediary systems 2: Gateway process
1: Server inside organization must authenticate external Client. Gateway sits on the Server side of the firewall a. GW authenticates Client (e.g. password, cert) b. requests carry authentication data e.g. SSL header (encrypted hash of message) 2: Client connects to non-http Server (e.g. FTP) Client sends http requests. GW translates
21
1/31/2008
Intermediary systems 3: Tunnel
Tunnel perform no operation on http messages used if an intermedate is required for the connection but understanding http not required E.g. Initial authentication of Client and/or Server After that messages retransmitted unchanged
HTTP Operation - Caches
Caching can be done by a client, server or intermediary system stores previous requests/ responses may return stored response to subsequent requests not all requests can be cached
22
1/31/2008
HTTP Messages
General Structure of HTTP message
Request Line: Method (e.g. GET), Resource (filename), HTTP Vers Response Status Line: HTTP Vers; Status Code e.g. OK; Reason Headers general: Date, Upgrade (to better version) Request: Host, Accept-charset, Response: Server (Softw), Accept-ranges (willing to take partial page with range expressed in bytes) Entity Header Content-Type, Last-Modified, Entity Body: Data (e.g. html page)
23
1/31/2008
Request and Reply
GET /rfc.html HTTP/1.1 Host: www.ietf.org HTTP/1.1 200 OK Date: Wed, 08 May 2002 22:54:22 GMT //Request Line //Request Header //Status Line //General Hdr
Server: Apache/1.3.20 (Unix) mod_ssl/2.8.4 /Response Hdr Last-Modified: Mon, 11 Sep 2000 13:56:29 GMT//Entity Headers ETag: 2a79d-c8b-39bce48d Accept-Ranges: bytes Content-length:3211 Content-Type: text/html X-pad: avoid browser bug <html> .. // non standard field //page id, used in caching //express range in bytes
Conditional GET (1) GET /fruit/kiwi.gif HTTP/.0 User-agent: Mozilla/4.0 HTTP/1.0 200 OK Date: Wed, 1 Aug 199815:39:29 Server: Apache/1.3.0 (Unix) Last-Modified: Mon, 22 June 1998 09:23:24 Content-Type: image/gif (data)
24
1/31/2008
Conditional GET (2)
One week later GET /fruit/kiwi.gif HTTP/1.0 User-agent: Mozilla/4.0 If-Modified-since: Mon, 22 June 1998 09:23;24 HTTP/1.0 304 Not Modified Date: Wed, 19 Aug 1998 15:39:29 Server: Apache/1.3.0 (Unix) (empty entity body)
HTTP1.1 Methods
25
1/31/2008
Response Status Codes
Request Headers
User-Agent Accept Host Authorization Cookie # Date Upgrade suggest switch to another version Info about the browser (OS) Type of pages client can handle The server DNS name client credentials (e.g. passwd) Cookie that was received before
Accept-charset; Accept-Encoding; Accept-Lang
26
1/31/2008
Response Headers
Server Info about the Server Content-Encoding; Content-Length; Content-Language; Content-Type (MIME type) Last-Modified Location commanding client to go elsewhere Accept-Ranges The server will accept byte range requests Set-Cookie # Please save attached cookie with number # Date Upgrade
Entity Body
entity body is an arbitrary sequence of octets HTTP can transfer any type of data including: text, binary data, audio, images, video data is content of resource identified by URL interpretation data determined by header fields: Content-Type - defines data interpretation Content-Encoding - applied to data Transfer-Encoding - used to form entity body
27
1/31/2008
More Header Fields
Forwarded: Gateways and proxies add this header with their URL Connection: close, keep-alive,.. special instructions Keep-Alive: If was set in Connection, it indicates max time the sender will keep connection open waiting for next request, or max number of additional requests that will be allowed on the current persistent connection Pragma Implementation specific info relevant to any recipient along the way
HTTP Messages BNF Format
HTTP-Message = Simple-Request | Simple-Response | Full-Request | Full-Response Full-Request = Request-Line *( General-Header | Request-Header | Entity-Header ) CRLF [ Entity-Body ] Full-Response = CRLF [ Entity-Body ] Simple-Request = "GET" SP Request-URL CRLF Simple-Response = [ Entity-Body ] Status-Line *( General-Header | Response-Header | Entity-Header )
28
1/31/2008
Content Delivery Networks
Content Delivery Networks (1)
A Content Provider has a main page with links to many content items (pictures, music, video, newspapers) A CDN company (e.g Akamai) contract Content Provider to deliver the content on their CDN contentservers The CDN also contract many O(10K) ISPs to put CDN content-servers with the content on the ISP nets The CDN redirects the links in the main page of the CP to CDN main Server (changing the href)
29
1/31/2008
Example: The Furry Video Content Provider
Original Web Page Of Content Provider
Web Page Of Content Provider With redirections
Example (Contd) User types www.furryvideo.com, gets to main page of the Content Provider FurryVideo User clicks on content item Client sends Request to the cdn-server.com cdn-Server identifies (from file name) which object is required, and from IP address of user, which CDN servers is the closest to the Client cdn-server sends response to client with status code 301 and Location header, giving the files URL on a content server close to the client Client connects to the CDN content-server
30