XML, DTD, and
XML Schema
Introduction to Databases
CompSci 316 Fall 2014
2
Announcements (Tue. Oct. 21)
• Midterm scores and sample solution posted
• You may pick up graded exams outside my office
• Mean: 83.9
• Stdev: 11.0
• Max: 100+5
• PHP and Django example website code posted;
more to come
• Homework #3 to be assigned on Thursday
• Project milestone #1 feedback to be returned this
weekend
3
Structured vs. unstructured data
• Relational databases are highly structured
• All data resides in tables
• You must define schema before entering any data
• Every row confirms to the table schema
• Changing the schema is hard and may break many things
• Texts are highly unstructured
• Data is free-form
• There is no pre-defined schema, and it’s hard to
define any schema
• Readers need to infer structures and meanings
What’s in between these two extremes?
4
…
5
Semi-structured data
• Observation: most data have some structure, e.g.:
• Book: chapters, sections, titles, paragraphs, references,
index, etc.
• Item for sale: name, picture, price (range), ratings,
promotions, etc.
• Web page: HTML
• Ideas:
• Ensure data is “well-formatted”
• If needed, ensure data is also “well-structured”
• But make it easy to define and extend this structure
• Make data “self-describing”
6
HTML: language of the Web
!
" ! # ! $
" % !
&
• It’s mostly a “formatting” language
• It mixes presentation and content
• Hard to change presentation (say, for different displays)
• Hard to extract content
7
XML: eXtensible Markup Language
'
"
#
$
" %
'
' & '
• Text-based
• Capture data (content), not presentation
• Data self-describes its structure
• Names and nesting of tags have meanings!
8
Other nice features of XML
• Portability: Just like HTML, you can ship XML data
across platforms
• Relational data requires heavy-weight API’s
• Flexibility: You can represent any information
(structured, semi-structured, documents, …)
• Relational data is best suited for structured data
• Extensibility: Since data describes itself, you can
change the schema easily
• Relational schema is rigid and difficult to change
9
' *+ ,-.*+ ,1 . / -. 0 .
XML terminology "
#
$
" %
• Tag names: ', ,… ' &
• Start tags: ' , ,…
• End tags: ' , ,…
• An element is enclosed by a pair of start and end
tags: ' & '
• Elements can be nested:
' & & & '
• Empty elements: ( ) ' ( ) '
• Can be abbreviated: ( ) '
• Elements can also have attributes: '
*+ ,-.&. / -. 0 .
Ordering generally matters, except for attributes
10
Well-formed XML documents
A well-formed XML document
• Follows XML lexical conventions
• Wrong: / % 2 ) & /
• Right: / % 2 ) 3 4 & /
• Other special entities: becomes 3 4 and 3 becomes 3 5 4
• Contains a single root element
• Has properly matched tags and properly nested
elements
• Right:
/ & / & / & /
• Wrong:
/ & / & / & /
11
A tree representation
' ' &
/ &
" # $ "
%
/ / / &
* / * &
/ 2
/ & &
5 1
/
12
More XML features
• Processing instructions for apps: 6 & 6
• An XML file typically starts with a version declaration using this
syntax: 6)5 7 -. 0 .6
• Comments: 811 9 55 11
• CDATA section: 8:9 ";":; < ' !&==
• ID’s and references
-. >. 5 # 5 5 &
-. ?@. 5 A 5 &
-. . -. >. 5 -. ?@.
5 5 &
&
• Namespaces allow external schemas and qualified names
' )5 <5 9 + -. < & 5 +/ 5 .
5 9 + < & 5 9 + <
5 9 + < & 5 9 + < &
'
• And more…
13
Valid XML documents
• A valid XML document conforms to a
Document Type Definition (DTD)
• A DTD is optional
• A DTD specifies a grammar for the document
• Constraints on structures and values of elements, attributes, etc.
• Example
8 B9;CDE :
8EFEAE,; G 'HI
8EFEAE,; ' G ! ! 6! 6! / I
8";;F*+; ' *+ , 9 ";" JKELM*KE
8";;F*+; ' / 9 ";" J*ADF*E
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
8EFEAE,; / GJD9 ";"N I
8EFEAE,; / G ! / 6! / I
=
14
DTD explained
8 B9;CDE :
is the root element of the document
8EFEAE,; G 'HI One or more
consists of a sequence of one or more ' elements
8EFEAE,; ' G ! ! 6! 6! / I
Zero or one
Zero or more
' consists of a , zero or more ,
an optional , and zero or more / ’s, in sequence
8";;F*+; ' *+ , * JKELM*KE
' has a required *+ , attribute which is a unique identifier
8";;F*+; ' / 9 ";" J*ADF*E
' has an optional (J*ADF*E )
price attribute which contains
character data ' *+ ,-.*+ ,1 . / -. 0 .
"
#
Other attribute types include $
" %
* KE (reference to an * ),
* KE + (space-separated list of references), ' &
enumerated list, etc.
15
DTD explained (cont’d)
8EFEAE,; GJD9 ";"I D9 ";" is text that will be parsed
8EFEAE,; GJD9 ";"I • 3 4 etc. will be parsed as entities
8EFEAE,; GJD9 ";"I • Use a 9 ";" section to include text verbatim
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
, , , and contain parsed character data
8EFEAE,; / GJD9 ";"N I
/ contains mixed content: text optionally interspersed with elements
8EFEAE,; / G ! / 6! / I
Recursive declaration:
Each / begins with a ,
followed by an optional / , / * /
and then zero or more / * / 2 /
= (sub) / ’s
5 1 / &
/
/ OAF
/ OAF & /
/
/ ;
/
/ ; & /
/
/ M
/ C / ; & /
/
/
/
16
Using DTD
• DTD can be included in the XML source file
• 6)5 7 -. 0 .6
8 B9;CDE :
& &
=
& &
• DTD can be external
• 6)5 7 -. 0 .6
8 B9;CDE +C+;EA .00 0 .
& &
• 6)5 7 -. 0 .6
8 B9;CDE 5 DM F*9 .1 %?9 ; O#;AF 0 + / E,.
. < 22202?0 ;K ) 5 ; ) 5 1 / 0 .
5
& &
5
17
Annoyance: content grammar
• Consider this declaration:
8EFEAE,; 17
G G 5 ! ! 5 ! I N
G 5 ! 7 5 ! 5 ! I I
• “N” means “or”
• Syntactically legal, but won’t work
• Because of SGML compatibility issues
• When looking at 5 , a parser would not know which
way to go without looking further ahead
• Requirement: content declaration must be
“deterministic” (i.e., no look-ahead required)
• Can we rewrite it into an equivalent, deterministic one?
• Also, you cannot nest mixed content declarations
• Illegal: 8EFEAE,; + / G ! GJD9 ";"N I ! / I
18
Annoyance: element name clash
• Suppose we want to represent book titles and
section titles differently
• Book titles are pure text: GJD9 ";"I
• Section titles can have formatting tags:
GJD9 ";"N N N5 I
• But DTD only allows one declaration!
• Workaround: rename as '1 and
/ 1 ?
• Not nice—why can’t we just infer a ’s context?
19
Annoyance: lack of type support
• Too few attribute types: string (9 ";"), token (e.g.,
* , * KE ), enumeration (e.g., G N N I)
• What about integer, float, date, etc.?
• ID not typed
• No two elements can have the same id, even if they have
different types (e.g., ' vs. / )
• Difficult to reuse complex structure definitions
• E.g.: already defined element E as G ! !
6! ! &I; want to define E> to have the same
structure
• Parameter entities in DTD provide a workaround
• 8E,;*;C P E0 / QG ! ! 6! ! &IQ
• 8EFEAE,; E PE0 / 4
• 8EFEAE,; E> PE0 / 4
• Something less “hacky”?
20
XML Schema
• A more powerful way of defining the structure and
constraining the contents of XML documents
• An XML Schema definition is itself an XML
document
• Typically stored as a standalone .xsd file
• XML (data) documents refer to external .xsd files
• W3C recommendation
• Unlike DTD, XML Schema is separate from the XML
specification
21
XML Schema definition (XSD)
6)5 7 -. 0 .6
) < / 5 )5 <) -. < 22202?0 > OAF+/ 5 .
& & Defines ) to be the namespace
described in the URL
Uses of ) < within the ) < / 5 element now
refer to tags from this namespace
& &
) < / 5
22
XSD example
) < 5 5 -. '. We are now defining an element named '
) </ 5 ); Declares a structure with child elements/attributes as opposed to just text)
) < R / Declares a sequence of child elements, like “(…, …, …)” in DTD
) < 5 5 -. . -.) < . A leaf element with string content
) < 5 5 -. . -.) < .
5 B// -. . 5 )B// -. . Like in DTD
) < 5 5 -. . -.) < . Like 6 in DTD
5 B// -. . 5 )B// -. .
) < 5 5 -. . -.) < . A leaf element with integer content
5 B// -. . 5 )B// -. .
) < 5 -. / . Like / in DTD; / is defined elsewhere
5 B// -. . 5 )B// -. .
) < R /
) < 5 -.*+ ,. -.) < . -. R .
Declares an attribute under '… and this attribute is required
) < 5 -. / . -.) < / 5 . -. .
) </ 5 ); This attribute has a decimal value, and it is optional
) < 5
23
XSD example cont’d
) < 5 5 -. / .
) </ 5 );
) < R / Another title definition; can be different
) < 5 5 -. . -.) < . from '
) < 5 5 -./ . 5 B// -. . 5 )B// -. .
Declares mixed content
) </ 5 ); 5 ) -. .
(text interspersed with structure below)
A compositor like ) </ / 5 B// -. . 5 )B// -. . 5 /5 )B// can be
) < R / ; attached to compositors too
) < 5 5 -. . -.) < .
this one declares
a list of alternatives, ) < 5 5 -. . -.) < .
like “G&N&N&I” ) </ /
in DTD Like GJD9 ";"N N I in DTD
) </ 5 );
) < 5
) < 5 -. / . 5 B// -. . 5 )B// -. .
) < R / Recursive definition
) </ 5 );
) < 5
24
XSD example cont’d
• To complete 0) :
) < 5 5 -. .
) </ 5 );
) < R /
) < 5 -. '. 5 B// -. . 5 )B// -. .
) < R /
) </ 5 );
) < 5
• To use 0) in an XML document:
6)5 7 -. 0 .6
)5 <) -. < 22202?0 > OAF+/ 5 1 / .
) < , 5 / +/ 5 F / -. < 0) .
' & & '
' & & '
& &
25
Named types
• Define once:
) </ 5 ); 5 -. 5 ; ) ; . 5 ) -. .
) </ / 5 B// -. . 5 )B// -. .
) < 5 5 -. . -.) < .
) < 5 5 -. . -.) < .
) </ /
) </ 5 );
• Use elsewhere in XSD:
& &
) < 5 5 -. . -. 5 ; ) ; .
) < 5 5 -./ . -. 5 ; ) ; .
5 B// -. . 5 )B// -. .
& &
26
Restrictions
) < 5 ; 5 -. / ; .
) < / -.) < / 5 .
) <5 * / 7 7 -. 0 .
) < /
) < 5 ;
) < 5 ; 5 -. ; .
) < / -.) < .
) < 5 7 -. /'.
) < 5 7 -. /'.
) < 5 7 -. .
) < /
) < 5 ;
27
Keys
) < 5 5 -. .
) </ 5 ); & & ) </ 5 );
) <' 5 -. 'S .
) < / ) -.0 '.
) < ) -.T*+ ,.
) <'
) < 5
• Under any element, elements
reachable by selector “0 '” (i.e., ' child
elements) must have unique values for field “T*+ ,”
(i.e., *+ , attributes)
• In general, a key can consist of multiple fields (multiple
) < elements under ) <' )
• More on XPath in next lecture
28
Foreign keys
• Suppose content can reference books
) < 5 5 -./ . ) < 5 5 -. .
) </ 5 ); 5 ) -. . ) </ 5 ); & & ) </ 5 );
) </ / 5 B// -. . 5 )B// -. . ) <' 5 -. 'S .
) < 5 5 -. . -.) < . ) < / ) -.0 '.
) < 5 5 -. . -.) < . ) < ) -.T*+ ,.
) < 5 5 -. '1 . ) <'
) <' 5 -. ' S .
) </ 5 );
-. 'S .
) < 5 -.*+ ,. ) < / ) -.0 '1 .
-.) < . ) < ) -.T*+ ,.
) </ 5 ); ) <'
) < 5 ) < 5
) </ /
) </ 5 );
) < 5
• Under , for elements reachable by
selector “0 '1 ” (i.e., any '1 element
underneath), values for field “T*+ ,” (i.e., *+ , attributes)
must appear as values of 'S , the key referenced
• Make sure ' is declared in the same scope
29
Why use DTD or XML Schema?
• Benefits of not using them
• Unstructured data is easy to represent
• Overhead of validation is avoided
• Benefits of using them
• Serve as schema for the XML data
• Guards against errors
• Helps with processing
• Facilitate information exchange
• People can agree to use a common DTD or XML Schema to
exchange data (e.g., XHTML)
30
XML versus relational data
Relational data XML data
• Schema is always fixed in • Well-formed XML does not
advance and difficult to require predefined, fixed
change schema
• Simple, flat table structures • Nested structure;
* /* KE (+) permit
arbitrary graphs
• Ordering of rows and • Ordering forced by
columns is unimportant document format; may or
may not be important
• Exchange is problematic • Designed for easy exchange
• “Native” support in all • Often implemented as an
serious commercial DBMS “add-on” on top of relations
31
Case study
• Design an XML document representing cities,
counties, and states
• For states, record name and capital (city)
• For counties, record name, area, and location (state)
• For cities, record name, population, and location (county
and state)
• Assume the following:
• Names of states are unique
• Names of counties are only unique within a state
• Names of cities are only unique within a county
• A city is always located in a single county
• A county is always located in a single state
32
A possible design
(
5 ) <
/ (/ ( ) < …
5 ) <
) < / 5 / / …
) <
5 ) <
) <
/ / …
Declare S in ( with
Selector 0
Field T 5 Declare / * + S in with
Selector 0 /
Declare / * 9 S in / with
Field T 5
Selector 0 /
Field T 5
Declare / * S in ( with
Selector 0 / /
Field T
Declare / 9 * S K in ( referencing / * S , with
Selector 0
Field T/ (/ (