0% found this document useful (0 votes)
35 views

Week2 Slides

This document discusses information representation and markup languages. It covers how computers represent raw data as binary digits and how text encoding standards like ASCII and Unicode map characters to numeric codes. The document introduces HTML as a descriptive markup language that separates content from presentation. It describes how HTML uses tags to structure documents and how CSS is used to control styling. The Document Object Model is introduced as a tree structure that represents an HTML document and allows for programmatic manipulation.

Uploaded by

Utkarsh Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Week2 Slides

This document discusses information representation and markup languages. It covers how computers represent raw data as binary digits and how text encoding standards like ASCII and Unicode map characters to numeric codes. The document introduces HTML as a descriptive markup language that separates content from presentation. It describes how HTML uses tags to structure documents and how CSS is used to control styling. The Document Object Model is introduced as a tree structure that represents an HTML document and allows for programmatic manipulation.

Uploaded by

Utkarsh Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Markup

Information representation
Markup

● Raw data vs Semantics
● Logical structure vs Styling
● HTML5 and CSS
Information representation
● Computers work only with “bits”
○ Binary digits: 0 and 1
● Numbers
○ Place value: binary numbers: eg. 6 = 0110
○ Two’s complement: negative numbers: eg. -6 = 1010
● Letters? Arbitrary Text?
Representing Text ●

ASCII
Unicode
● UTF-8
Information Interchange
● Communicate through machines - either between machines or between humans
● Machines only work with bits
● Standard “encoding”
○ Some sequence of bits interpreted as a character
Interpretation
What is “0100 0001”?

● String of bits
● Number with value 65 decimal
● Character “A”
● All of the above
Interpretation
What is “0100 0001”?

● String of bits
● Number with value 65 decimal
● Character “A”
● All of the above

Matter of interpretation and Context


ASCII
● American Standard Code for Information Interchange
● 7-bits: 128 different entities
○ ‘a’ .. ‘z’
○ ‘A’ .. ‘Z’
○ ‘0’ .. ‘9’
○ Special characters: !@#$%^&*() …
● Why 7-bits?
● What about other characters? अ அ അ ฆ 不
○ 1000s of characters needed
Unicode
● Allow codes for more scripts, characters
● How many?
○ All living languages? All extinct languages? All future languages?
● “Universal Character Set” encoding - UCS
○ UCS-2: 2 bytes per character - max 65,536 characters
○ UCS-4: 4 bytes per character: 4 Billion+ characters
Efficiency?
● Most common language on Web: ???
● Should all characters be represented with same number of bits?
Efficiency?
● Most common language on Web: ???
● Should all characters be represented with same number of bits?
● Example:
○ Text document with 1000 words, approximately 5000 characters (including spaces)
Efficiency?
● Most common language on Web: ???
● Should all characters be represented with same number of bits?
● Example:
○ Text document with 1000 words, approximately 5000 characters (including spaces)
○ UCS-4 encoding: 32b x 5000 = 160,000 bits
Efficiency?
● Most common language on Web: ???
● Should all characters be represented with same number of bits?
● Example:
○ Text document with 1000 words, approximately 5000 characters (including spaces)
○ UCS-4 encoding: 32b x 5000 = 160,000 bits
○ ASCII encoding: 8b x 5000 = 40,000 bits
○ Original 7-bit ASCII sufficient for English: 7b x 5000 = 35,000 bits
Efficiency?
● Most common language on Web: ???
● Should all characters be represented with same number of bits?
● Example:
○ Text document with 1000 words, approximately 5000 characters (including spaces)
○ UCS-4 encoding: 32b x 5000 = 160,000 bits
○ ASCII encoding: 8b x 5000 = 40,000 bits
○ Original 7-bit ASCII sufficient for English: 7b x 5000 = 35,000 bits
○ Minimum needed to encode just ‘a’ - ‘z’, numbers and some special characters: could fit in 6 bits: 30,000 bits
Efficiency?
● Most common language on Web: ???
● Should all characters be represented with same number of bits?
● Example:
○ Text document with 1000 words, approximately 5000 characters (including spaces)
○ UCS-4 encoding: 32b x 5000 = 160,000 bits
○ ASCII encoding: 8b x 5000 = 40,000 bits
○ Original 7-bit ASCII sufficient for English: 7b x 5000 = 35,000 bits
○ Minimum needed to encode just ‘a’ - ‘z’, numbers and some special characters: could fit in 6 bits: 30,000 bits
○ Optimal coding based on frequency of occurrence:
■ ‘e’ is most common letter, ‘t’, ‘a’, ‘o’, …
■ Huffman or similar encoding: ~ 10-20,000 bits, possibly less
Solvable in general?
● Impossible to encode by actual character frequency: depends on text
○ Just use compression methods like “zip” instead!
● But can encoding be a good halfway point?

Example:

● Use 1 byte for most common alphabets


● Group others according to frequency, have “prefix” codes to indicate
Prefix Coding

1st Byte 2nd Byte 3rd Byte 4th Byte Free Bits Maximum Expressible Unicode Value

0xxxxxxx 7 007F hex (127)

110xxxxx 10xxxxxx (5+6)=11 07FF hex (2047)

1110xxxx 10xxxxxx 10xxxxxx (4+6+6)=16 FFFF hex (65535)

10FFFF hex (1,114,111)


11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21
Example

Src: https://www.w3.org/International/articles/definitions-characters/
UTF-8
● Use 8 bits for most common characters: ASCII subset
○ All ASCII documents are automatically UTF-8 compatible
● All other characters can be encoded based on prefix string
● More difficult for text processor:
○ first check prefix
○ linked list through chain of prefixes possible
○ Still more efficient for majority of documents
● Most common encoding in use today
Markup ●

Content vs Meaning
Types of markup
● (X)HTML
Content
Markup
Result
Types of Markup

Coombs et al, “Communication Systems and the Future of Scholarly


Text Processing”, Communications of ACM, 1987
Types of Markup
● Presentational
○ WYSIWYG: directly format output and display
○ Embed codes not part of regular text, specific to the editor

Coombs et al, “Communication Systems and the Future of Scholarly Text Processing”,
Communications of ACM, 1987
Types of Markup
● Presentational
○ WYSIWYG: directly format output and display
○ Embed codes not part of regular text, specific to the editor
● Procedural
○ Details on how to display:
■ change font to large, bold
■ skip 2 lines, indent 4 columns

Coombs et al, “Communication Systems and the Future of Scholarly Text Processing”,
Communications of ACM, 1987
Types of Markup
● Presentational
○ WYSIWYG: directly format output and display
○ Embed codes not part of regular text, specific to the editor
● Procedural
○ Details on how to display:
■ change font to large, bold
■ skip 2 lines, indent 4 columns
● Descriptive
○ This is a <title>, this is a <heading>, this is a <paragraph>

Coombs et al, “Communication Systems and the Future of Scholarly Text Processing”,
Communications of ACM, 1987
Examples
● MS Word, Google Docs etc:
○ User interface focused on “appearance”, not meaning
○ WYSIWYG: direct control over styling
○ Often leads to complex formatting and loss of inherent meaning
● LaTeX, HTML (general *ML)
○ Focus on meaning
○ More complex to write and edit, not WYSIWYG in general
Semantic Markup
● Content vs Presentation
● Semantics
○ Meaning of the text
○ structure or logic of the document
HTML (and co.) ●

HyperText Markup Language
Generalizations
● Variants of Interest
HyperText Markup Language
● HTML first used by Tim Berners-Lee in original Web at CERN (~1989)
● Considered an application of SGML (Standard Generalized Markup Language)
○ Strict definitions on structure, syntax, validity
● HTML meant for browser interpretation
○ Very forgiving: loose validity checks
○ Best effort to display
HTML Example
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

</body>
</html>
Tags
● <h1> </h1> - paired tags
● Angle brackets < >
● Closing tag with /
● Location specific: <DOCTYPE>: only at head of doc
● Case-insensitive
Nesting
● <em><strong>Hello</strong></em>
● Hello

Invalid:

● <em><strong>Hello</em></strong>
● <em><strong>Hello</em>
● <em><strong>Hell<o/em></strong>
Presentation vs Semantics
● <strong>Hello</strong>
● <b>Hello</b>
● Hello

Which one is right? Which is better?


Timelines
● SGML based
○ 1989 - HTML original
○ 1995 - HTML 2
○ 1997 - HTML 3, 4
● XML based
○ XHTML - 1997 - mid 2010s
● HTML5
○ first release 2008
○ W3C recommendation - 2014
HTML5
● Block elements: <div>
● Inline elements: <span>
● Logical elements: <nav>, <footer>
● Media: <audio>, <video>

Remove “presentation only” tags:

● <center>
● <font>
Document Object Model
<html>
<head>
<title>My title</title>
</head>
<body>
<h1>A heading</h1>
<a href=”link”>Link Text</a>
</body>
</html>
Document Object Model
<html>
<head>
<title>My title</title>
</head>
<body>
<h1>A heading</h1>
<a href=”link”>Link Text</a>
</body>
</html>

Src: B. Eriksson, Wikipedia


DOM
● Tree structure representing logical layout of document
● Direct manipulation of tree possible!
● Application Programming Interfaces (APIs)
○ Canvas
○ Offline
○ Web Storage
○ Drag and Drop
○ …
● Javascript primary means of manipulating
● CSS used for styling
Styling ●

Markup vs Style
Themes
● CSS
Markup vs Style
<h1>Hello</h1>

Hello Font - Garamond, Size: 24, Bold

Font: Arial, Size: 30, Bold


Hello
Font: Comic Sans, Size: 24, Bold, Italic,
Hello FontColor: Green, Background: Red
Separation of Styling
● Style hints in separate blocks
○ Separate files included
● Themes
● Style Sheets
○ Specify presentation information
● Cascading Style Sheets (CSS)
○ Allow multiple definitions
○ Latest takes precedence
Inline CSS
● Directly add style to the tag
● Example:

<h1 style="color:blue;text-align:center;">A heading</h1>


Internal CSS
● Embed inside <head> tag
<style>
● Now all <h1> tags in document will
body {
look the same - centrally modified
background-color: linen;
}

h1 {
color: maroon;
margin-left: 40px;
}
</style>
External CSS
● Extract common content for reuse
● Multiple CSS files can be included
● Latest definition of style takes precedence
Responsive Design
● Mobile and Tablets have smaller screens
○ Different form factors
● Adapt to screen - Respond
● CSS control styling - HTML controls content!
Bootstrap
● Commonly used framework
○ Originated from Twitter
○ Widely used now
● Standard styles for various components
○ Buttons
○ Forms
○ Icons
● Mobile first: highly responsive layout
Javascript?
● Interpreted language brought into the browser
● Not really related to Java in any way - formally ECMAScript
● Why?
○ HTML is not a programming language
○ CSS is not a programming language (well, …)
● Would still like to have “programmability” inside browser
● Not part of the core presentation requirements
○ Very useful, but will be considered later
● Presentation - Human interaction

Summary ● Separate content from style


○ Markup - HTML
○ Styling - CSS

You might also like