0% found this document useful (0 votes)
61 views77 pages

XML Basics: Structure and Syntax Guide

The document provides an overview of XML technology, detailing its purpose, syntax, and structure, emphasizing its role in data representation and interchange. It explains the importance of well-formed XML documents, validation against schemas like DTD and XSD, and the differences between structured, semi-structured, and unstructured data. Additionally, it includes practical examples and tasks for creating XML documents, highlighting XML's flexibility and usability in various applications.

Uploaded by

jeviskang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views77 pages

XML Basics: Structure and Syntax Guide

The document provides an overview of XML technology, detailing its purpose, syntax, and structure, emphasizing its role in data representation and interchange. It explains the importance of well-formed XML documents, validation against schemas like DTD and XSD, and the differences between structured, semi-structured, and unstructured data. Additionally, it includes practical examples and tasks for creating XML documents, highlighting XML's flexibility and usability in various applications.

Uploaded by

jeviskang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CEB408: XML and Document Content Description

Lecture Notes

Lecturer: [Link] Ajebua

Module 1: Foundations of XML Technology


1.1 Introduction to XML

Extensible Markup Language (XML) is a markup language standardized by the World


Wide Web Consortium (W3C). Similar in appearance to HyperText Markup Language
1

(HTML), XML differs fundamentally in its purpose and flexibility. While HTML uses
predefined tags primarily for presenting information in web browsers, XML allows users
to define their own tags, making it a meta-language capable of defining other
specialized markup languages. Examples of XML-based languages include XHTML,
1

MathML, SVG, RSS, and RDF. Its primary function is not presentation but rather the
3

description and structuring of data. 1

1.2 Purpose and Role of XML

The core purpose of XML is to provide a standardized, text-based format for


representing data in a way that is both human-readable and machine-readable. Its 4

extensibility, achieved through user-defined tags, allows it to describe a wide variety of


data structures specific to different domains or applications. This makes XML a
3

powerful tool for storing, searching, and, crucially, sharing data across different systems
and platforms. Because the fundamental syntax is standardized by the W3C , systems
3 2

can exchange XML data reliably, knowing that the recipient system will be able to parse
it correctly. This interoperability makes XML highly suitable for data interchange over
3

the internet, such as in web services where it can be used to structure requests and
responses. 1

1.3 Basic XML Syntax

XML documents adhere to a specific set of syntax rules, drawing parallels with HTML
but with stricter enforcement. 3

1
• Elements: The fundamental building blocks are elements, marked by start tags
(e.g., <book>) and end tags (e.g., </book>). The content resides between these
tags. Elements can be nested hierarchically to represent data relationships. Empty
3 3

elements, which have no content, can use a self-closing tag format (e.g.,
<image/>). 3

• Attributes: Elements can possess attributes within their start tag to provide
additional metadata. Attributes consist of a name-value pair, with the value
enclosed in quotes (single or double). For example: <book category="fiction">.
3

• Root Element: Every XML document must contain exactly one top-level element,
known as the root element, which encloses all other elements. 3

• XML Declaration (Prolog): While not strictly mandatory in all contexts, XML
documents typically begin with an XML declaration, specifying the XML version and
character encoding. Example: <?xml version="1.0" encoding="UTF-8"?>. This is
3

technically a processing instruction, not a tag. 3

• Comments: Comments can be added using `` and are ignored by parsers. 3

• Case Sensitivity: Unlike HTML, XML is case-sensitive. Start and end tags must
match exactly in case (e.g., <Book> is different from <book>). 3

• Nesting: Elements must be properly nested; overlapping tags are forbidden. 3

Correct: <parent><child></child></parent>. Incorrect:


<parent><child></parent></child>.

1.4 Structured vs. Semi-structured Data

Data can be broadly categorized based on its organization:

• Structured Data: This data adheres to a rigid, predefined schema, typically


organized into tables with rows and columns, as found in relational databases
(RDBMS) or spreadsheets. Each column has a specific data type, and
10

relationships are clearly defined. It is easily queried using languages like SQL.
14 14

• Unstructured Data: This data lacks a specific, predefined organizational structure,


making it difficult to process and analyze using traditional tools. Examples include
plain text documents, images, videos, and audio files. 12

2
• Semi-structured Data: This category falls between structured and unstructured
data. It does not conform to the strict tabular structure of relational databases but
6

contains organizational properties like tags, markers, or metadata that facilitate


parsing and analysis. Key characteristics include:
6

o Flexibility: Accommodates variations in structure and allows schema evolution


over time without major disruption. Entities of the same class might have
10

different attributes. 6

o Self-Descriptive: Often includes tags or metadata within the data itself (like XML
tags or JSON key-value pairs) that provide context about the content and
structure. 6

o Hierarchical Structure: Frequently uses nested or hierarchical structures to


represent complex relationships. 10

XML is a prime example of semi-structured data. Its use of tags provides organization
6

and self-description, making it more structured than plain text. However, its flexibility in
defining tags and structures means it doesn't adhere to the rigid schema requirements
of traditional structured data found in RDBMS. This flexibility makes XML suitable for
10

exchanging diverse data on the internet and representing data where the structure
might vary or evolve. 6

The rise of semi-structured data, driven by web applications, APIs, mobile devices, and
IoT, necessitates data platforms capable of handling both structured and semi-
structured formats effectively for business intelligence and analytics. 11

1.5 Well-Formed XML Documents

For an XML document to be processed correctly by an XML parser, it must be well-


formed. This means it strictly adheres to all fundamental XML syntax rules. The key
3

requirements for well-formedness are:

1. Single Root Element: The document must contain exactly one root element.3
2. Matching Tags: Every start tag must have a corresponding end tag, or the element
must be written using the self-closing tag syntax. 3

3. Proper Nesting: Elements must be nested correctly without any overlap. 3

3
4. Attribute Values Quoted: All attribute values must be enclosed in either single or
double quotes. 3

5. Case Sensitivity: Element and attribute names are case-sensitive; start and end
tags must match case. 3

6. Special Characters: Reserved characters like <, >, &, ", ' must be escaped using
their corresponding entity references (<, >, &, ", ') when used within element
content or attribute values, unless they are part of markup itself. 3

A document that fails any of these rules is not well-formed and will typically cause an
error during parsing. Well-formedness is the baseline requirement for any XML
3

document. Further validation against a schema (like DTD or XSD) checks for structural
and content validity, which is discussed in Module 2. 3

1.6 Simple XML Document Examples

Example 1: A Single Note

XML

<?xml version="1.0" encoding="UTF-8"?>


<note>
<to>Tove</to>

<from>Jani</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>


</note>

Description: This document represents a simple note. <note> is the root element,
containing child elements <to>, <from>, <heading>, and <body>, each holding text
content. It includes an XML declaration. 3

4
Example 2: A Book Representation

XML

<?xml version="1.0" encoding="UTF-8"?>


<book category="Science Fiction" isbn="978-0345391803">
<title lang="en">The Hitchhiker's Guide to the Galaxy</title>
<author>Douglas Adams</author>
<year>1979</year>

<publisher>Pan Books</publisher>
</book>

Description: This document describes a book. <book> is the root element and includes
attributes category and isbn. It contains child elements like <title> (which itself has a
lang attribute), <author>, <year>, and <publisher>. 4

Example 3: Empty Element

XML

<?xml version="1.0" encoding="UTF-8"?>


<product>
<name>Laptop</name>

<instock available="true"/> </product>

Description: This example shows an empty element <instock> using the self-closing tag
syntax. It also includes an attribute available. 3

5
These examples illustrate the basic structure, nesting, attributes, and text content
common in XML documents.

1.7 Module Insights & Connections

XML's design principle of separating data structure and content from presentation
(unlike HTML, which mixes them ) is a cornerstone of its utility. This separation allows
1

the same XML data source to be presented in multiple ways (e.g., styled with CSS for
web display , transformed with XSLT into HTML or PDF , or processed by backend
3 3

applications) without altering the underlying data. This promotes data reusability and
simplifies maintenance, as changes to presentation logic do not require changes to the
data structure itself.

Furthermore, the concept of well-formedness provides a baseline guarantee of syntactic


correctness. This strict requirement, enforced by all conforming XML parsers, ensures
3

a level of predictability and reliability when exchanging data between systems. While
potentially perceived as less forgiving than HTML's error handling , this strictness is
19

crucial for data integrity and automated processing, forming the foundation upon which
more complex validation mechanisms (DTDs, XSDs) are built.

1.8 Practical Idea: Create a Simple Contacts XML

Task: Create a well-formed XML file named [Link].

1. Include an XML declaration.


2. Define a root element named <contacts>.
3. Inside <contacts>, add at least two <contact> elements.
4. Each <contact> element should have a unique id attribute (e.g., id="c001").
5. Each <contact> element should contain the following child elements:
o <name> (containing the contact's full name)
o <email> (containing the contact's email address)
o At least one <phone> element (containing a phone number).
6. Each <phone> element should have a type attribute indicating the phone type
(e.g., type="mobile", type="work").

6
7. Ensure the document is well-formed (correct nesting, closing tags, quoted
attributes).

Example Structure Snippet:

XML

<?xml version="1.0" encoding="UTF-8"?>


<contacts>
<contact id="c001">

<name>Alice Wonderland</name>
<email>alice@[Link]</email>

<phone type="mobile">123-456-7890</phone>

</contact>

<contact id="c002">

<name>Bob The Builder</name>


<email>bob@[Link]</email>

<phone type="work">987-654-3210</phone>

<phone type="mobile">555-555-5555</phone>

</contact>

</contacts>

Module 2: Modeling and Validation of XML Documents


2.1 Introduction to Validation (Well-Formed vs. Valid)

While Module 1 established that an XML document must be well-formed to be


syntactically correct , this only guarantees adherence to basic XML rules. It does not
3

ensure that the document follows a specific structure or uses appropriate content for a
particular application.

7
This leads to the concept of a valid XML document. A valid XML document is one that is
not only well-formed but also conforms to the rules defined by a specific schema, such
as a Document Type Definition (DTD) or an XML Schema Definition (XSD). These 3

schemas define the legal building blocks of an XML document: the allowed elements,
attributes, their order, nesting, data types (more so in XSD), and occurrences. 9

Validation ensures that the XML data is structured consistently and correctly according
to predefined business rules or data exchange agreements. 23

2.2 Document Type Definition (DTD)

DTDs represent an older, established mechanism for defining the structure of an XML
document, originating from SGML. They define the legal elements, attributes, and their
22

arrangement. 9

• Purpose: To define the grammatical structure of an XML document, specifying


which elements and attributes are allowed and how they can be nested. Parsers 9

can use a DTD to validate an XML document's conformity. 22

• Linking DTD to XML: A DTD is associated with an XML document via the
<!DOCTYPE> declaration, placed after the XML declaration. 9

o Internal DTD: Declarations are embedded directly within the XML file inside
square brackets [...] in the DOCTYPE declaration. 9

o External DTD: Declarations reside in a separate .dtd file, referenced using


SYSTEM "uri" or PUBLIC "public_id" "uri" in the DOCTYPE declaration. 9

• Syntax: DTDs use a unique, non-XML syntax. 23

o <!ELEMENT element-name content-model>: Defines an element and its


allowed content. Content models include:
▪ (#PCDATA): Parsed character data (text). 9

▪ (child1, child2,...): A specific sequence of child elements. 9

▪ (child1 | child2): A choice between child elements.


▪ EMPTY: No content allowed. 9

▪ ANY: Any content allowed. 9

▪ Occurrence indicators: ? (zero or one), * (zero or more), + (one or more)


can follow element names or groups. 9

8
o <!ATTLIST element-name attr-name attr-type default-decl>: Defines
attributes for a given element. 9

▪ attr-name: The name of the attribute.


▪ attr-type: Often CDATA (character data), but also includes ID, IDREF,
IDREFS, NMTOKEN, NMTOKENS, or an enumerated list (val1|val2|...). 9

Data typing is very limited compared to XSD. 22

▪ default-decl: Specifies if the attribute is required (#REQUIRED), optional


(#IMPLIED), has a fixed value (#FIXED "value"), or a default value
("value"). 9

o <!ENTITY entity-name "entity-value">: Defines an internal entity (text


substitution). 3

o <!ENTITY entity-name SYSTEM "uri">: Defines an external entity. Note: Entity


9

expansion, particularly with external entities, can be exploited in attacks like the
"Billion Laughs" attack, a denial-of-service vulnerability. 28

• Example (Internal DTD for Contacts):


XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE contacts>
<contacts>
<contact id="c001">

<name>Alice Wonderland</name>
<email>alice@[Link]</email>

<phone type="mobile">123-456-7890</phone>

</contact>

<contact id="c002">

<name>Bob The Builder</name>


<email>bob@[Link]</email>

<phone type="work">987-654-3210</phone>

<phone type="mobile">555-555-5555</phone>

</contact>

</contacts>

9
(Syntax based on ) 9

• Validation Example: The XML above, including the first two <contact> elements, is
valid against the internal DTD. An XML validator would confirm this. If the third
(commented out) <contact> element were included, the document would become
invalid because it lacks the required id attribute as defined by <!ATTLIST contact id
CDATA #REQUIRED>. Similarly, using a phone type other than "mobile", "work", or
"home" would cause invalidity. 3

• Limitations: DTDs have significant limitations: they are not namespace-aware,


hindering their use in modular or mixed-vocabulary documents. Their data typing
22

capabilities are very weak, mostly treating content as strings (#PCDATA). The22

syntax is not XML-based, requiring separate parsing logic. They offer less control
23

over cardinality and structure compared to XSD. DTDs are generally considered
23

less powerful and flexible than XSD. 24

2.3 XML Schema Definition (XSD)

XML Schema Definition (XSD), also known as XML Schema, is the W3C-recommended
standard for defining the structure, content, and data types of XML documents. It 9

overcomes many limitations of DTDs and provides a much richer and more precise way
to model XML data. 23

• Purpose: To describe the structure and constrain the content of XML documents,
including detailed data typing and namespace support. 23

• Key Advantages over DTD:


o Rich Data Types: Supports a wide range of built-in data types (e.g., xs:string,
xs:integer, xs:decimal, xs:date, xs:boolean) and allows the creation of custom
types derived from base types with restrictions (e.g., patterns, length, ranges). 9

o Namespace Aware: Fully supports XML namespaces, enabling schema


modularity and preventing name conflicts when combining different XML
vocabularies. 22

10
o XML-Based Syntax: XSD schemas are themselves written in XML, making them
processable by standard XML tools and potentially easier for XML developers
to learn and integrate. 9

o Extensibility: More extensible and provides greater control over XML structure
and content rules. 23

o Object-Oriented Concepts: Supports concepts like type inheritance (extension


and restriction).
• Syntax: XSD uses XML syntax, typically using the xs prefix bound to the
[Link] namespace. 30

o <xs:schema>: The root element of an XSD document. Key attributes include:


▪ xmlns:xs="[Link] Declares the XML
Schema namespace, conventionally mapped to the xs prefix. 30

▪ targetNamespace="uri": Specifies the namespace that the elements and


types defined in this schema belong to. 30

▪ elementFormDefault="qualified|unqualified": Determines if locally declared


elements must be namespace-qualified in instance documents (usually set
to qualified). 32

▪ attributeFormDefault="qualified|unqualified": Similar to elementFormDefault


but for attributes (often defaults to unqualified).
o <xs:element name="..." type="...">: Declares an element. Attributes include:
▪ name: The name of the element. 9

▪ type: The data type of the element (e.g., xs:string, xs:integer, or a custom
type name). 9

▪ minOccurs: Minimum number of times the element can appear (default is


1).26

▪ maxOccurs: Maximum number of times the element can appear (default is


1, unbounded for unlimited). 26

▪ Can contain an inline type definition (<xs:simpleType> or


<xs:complexType>) instead of using the type attribute.
o <xs:attribute name="..." type="..." use="...">: Declares an attribute. Attributes
26

include:

11
▪ name: The name of the attribute.
▪ type: The data type of the attribute.
▪ use: Specifies if the attribute is required, optional (default), or prohibited.
▪ default: Provides a default value.
▪ fixed: Provides a fixed value that cannot be changed in the instance.
o <xs:simpleType name="...">: Defines a simple type (text-only content, possibly
with attributes via <xs:simpleContent>). Often used with <xs:restriction> to
constrain built-in types (e.g., using <xs:pattern>, <xs:minLength>,
<xs:maxLength>, <xs:enumeration>). 20

o <xs:complexType name="...">: Defines a complex type, which can contain


child elements and attributes. Content models are defined using compositors:
9

▪ <xs:sequence>: Child elements must appear in the specified order. 9

▪ <xs:choice>: Only one of the specified child elements or groups can


appear.
▪ <xs:all>: All specified child elements can appear once, in any order.
▪ Can define attributes using <xs:attribute> declarations directly within the
complex type definition.
o Namespaces: The targetNamespace defines the vocabulary being created by
the schema. Instance documents using this vocabulary declare this namespace
(often as the default namespace) and use xsi:schemaLocation to associate the
namespace URI with the physical location of the XSD file. The xs prefix is used
30

within the schema to refer to built-in XSD elements and types. 30

• Example (XSD for Contacts - [Link]):


XML
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="[Link]
targetNamespace="[Link]
xmlns:tns="[Link]
elementFormDefault="qualified">

<xs:element name="contacts">

<xs:complexType>

12
<xs:sequence>

<xs:element name="contact" type="tns:ContactType" minOccurs="1"


maxOccurs="unbounded"/>
</xs:sequence>

</xs:complexType>

</xs:element>

<xs:complexType name="ContactType">

<xs:sequence>

<xs:element name="name" type="xs:string"/>

<xs:element name="email" type="tns:EmailType"/> <xs:element name="phone"


type="tns:PhoneType" minOccurs="1" maxOccurs="unbounded"/>
</xs:sequence>

<xs:attribute name="id" type="xs:string" use="required"/>

</xs:complexType>

<xs:complexType name="PhoneType">

<xs:simpleContent>

<xs:extension base="xs:string">

<xs:attribute name="type" type="tns:PhoneCategoryType" use="required"/>

</xs:extension>

</xs:simpleContent>

</xs:complexType>

<xs:simpleType name="PhoneCategoryType">

<xs:restriction base="xs:string">

<xs:enumeration value="mobile"/>

<xs:enumeration value="work"/>

<xs:enumeration value="home"/>

</xs:restriction>

</xs:simpleType>

13
<xs:simpleType name="EmailType">

<xs:restriction base="xs:string">

<xs:pattern value=".+@.+\..+"/>

</xs:restriction>

</xs:simpleType>

</xs:schema>

(Based on concepts from ) 9

• Linking XSD to XML: The instance XML document uses attributes from the XML
Schema Instance namespace (xsi, typically [Link]
instance) to link to the schema. 30

o Declare the target namespace (often as the default namespace using xmlns).
o Declare the xsi namespace prefix.
o Use xsi:schemaLocation="namespaceURI schemaURL" to map the target
namespace URI to the location of the XSD file.

XML
<?xml version="1.0" encoding="UTF-8"?>
<contacts xmlns="[Link]
xmlns:xsi="[Link]
xsi:schemaLocation="[Link] [Link]">

<contact id="c001">

<name>Alice Wonderland</name>
<email>alice@[Link]</email> <phone type="mobile">123-456-7890</phone>
</contact>
<contact id="c002">

<name>Bob The Builder</name>


<email>bob@example</email> <phone type="cell">987-654-3210</phone> </contact>
</contacts>
(Based on ) 32

14
• Validation Example: When validating the above XML against [Link], the
first <contact> element would be valid. The second <contact> element would be
invalid due to two errors: the email format doesn't match the EmailType pattern,
and the phone type "cell" is not one of the allowed values ("mobile", "work", "home")
defined in PhoneCategoryType. Validators (online tools, IDEs, libraries like lxml in
20

Python or JAXP/JAXB in Java) perform these checks. 20

2.4 DTD vs. XSD Comparison

The choice between DTD and XSD depends on the complexity and requirements of the
XML application. While DTDs are simpler for basic structural validation, XSDs offer
significantly more power and flexibility, particularly for data-oriented applications
requiring type checking and namespace support. The following table summarizes the
key differences:

Feature DTD (Document Type XSD (XML Schema Definition)


Definition)

Syntax Unique, non-XML syntax 23 Uses XML syntax 9

Data Types Very limited (mostly Rich built-in types (string, number,
#PCDATA/string) 22 date, etc.) + custom types 22

Namespaces Not supported 22 Fully supported 22

Validation Power Basic structure, element order, Structure, types, patterns, ranges,
attributes uniqueness, keys/keyrefs 23

Extensibility Limited (Parameter Entities) 26 Highly extensible (import, include,


redefine types) 23

Readability/Maintainability Can be compact but less XML syntax can be verbose but
readable for complex schemas potentially more readable 26

Complexity Simpler for basic tasks 23 More complex due to richer feature
set 23

15
Tooling Support Widely supported by older tools Excellent support in modern XML
29
tools, parsers, IDEs 28

Entities Supports internal & external No direct equivalent; different


entities 9 mechanisms for reuse

Inline Declaration Can be embedded within the Typically external (.xsd file), linked
XML file 22 via xsi:schemaLocation 27

(Table data synthesized from ) 9

2.5 Module Insights & Connections

The transition from DTD's unique syntax to XSD's XML-based representation


represents a significant step towards unifying the XML ecosystem. Because XSDs are
23

themselves XML documents, the same tools and programming libraries used to parse,
manipulate, and generate XML data can also be used to work with the schemas
themselves. This self-describing nature facilitates automated schema processing,
27

generation of documentation, code generation based on schemas (as seen with JAXB
), and easier validation of the schema documents themselves. This contrasts sharply
36

with DTDs, which require specialized parsers distinct from standard XML parsers. 27

Furthermore, XSD's introduction of robust namespace support directly addressed a


major DTD limitation, particularly relevant in the context of the web and large-scale data
integration. As XML usage grew, the need to combine vocabularies from different
22

sources (e.g., mixing XHTML with SVG, or combining industry-specific schemas)


became critical. DTDs, lacking namespace awareness, could easily lead to naming
collisions where identically named elements from different vocabularies were
indistinguishable. XSD's mandatory namespace handling provides a mechanism to
23

uniquely identify elements and attributes based on their associated namespace URI,
preventing ambiguity and enabling the creation of modular, reusable, and interoperable
schemas essential for complex systems.

However, the increased power and expressiveness of XSD came at the cost of
complexity. The XSD specification is considerably larger and more intricate than the
23

16
DTD specification. This complexity led some developers to seek simpler or different
approaches to schema definition, contributing to the development and adoption of
alternative schema languages like Relax NG and Schematron. These alternatives often
24

focus on different validation aspects or offer simpler syntax, suggesting that while XSD
became the dominant standard, the quest for the optimal balance between expressive
power and usability in schema languages continued. This reflects a common pattern in
technology evolution where powerful, comprehensive solutions may sometimes be
perceived as overly complex for specific needs, leading to the emergence of more
focused or streamlined alternatives.

2.6 Practical Idea: Define DTD/XSD for Contacts XML

Task: Using the [Link] file created in Module 1:

1. Internal DTD: Modify [Link] to include an internal DTD (within the


<!DOCTYPE [...]> section) that validates its current structure (elements: contacts,
contact, name, email, phone; attributes: id on contact, type on phone). Ensure all
elements and attributes used in the XML are declared.
2. External XSD: Create a separate file named [Link]. Write an XML Schema
definition that:
o Defines the structure (contacts containing contacts, etc.).
o Specifies basic data types (e.g., xs:string for name, xs:string for id).
o Uses the EmailType simple type with the pattern restriction from the example in
section 2.3.
o Uses the PhoneCategoryType simple type with the enumeration restriction from
the example in section 2.3 for the type attribute of the phone element.
o Sets the targetNamespace to [Link]
3. Link XSD: Modify the [Link] file (remove the internal DTD if added in step
1) to link to the external [Link] using the xsi:schemaLocation attribute.
Ensure the root element declares the target namespace
(xmlns="[Link] and the xsi namespace.
4. Validation (Conceptual): Use an online XML validator or an XML editor with
validation capabilities (like Visual Studio , XMLSpy, Oxygen XML Editor) to
34

17
validate [Link] first against the DTD (if created) and then against the XSD.
Introduce deliberate errors (e.g., incorrect phone type, malformed email, missing
required attribute) and observe the validation failures reported by the tool.

Module 3: Presentation of XML Documents


While XML's primary role is data description , mechanisms exist to control its
1

presentation, particularly for display in web browsers or transformation into other


formats. The two main technologies for this are Cascading Style Sheets (CSS) and
Extensible Stylesheet Language Transformations (XSLT).

3.1 Styling XML with CSS

Cascading Style Sheets (CSS) is a stylesheet language used to describe the


presentation of documents written in markup languages like HTML or XML. It allows
38

separation of presentation rules from the document structure. 40

• Concept: CSS rules can be applied to XML elements to control their visual
rendering (fonts, colors, spacing, layout) when displayed in a compatible browser. 3

The browser parses the XML and applies the associated CSS rules.
• Linking CSS to XML: The standard mechanism is the xml-stylesheet processing
instruction, placed in the XML document's prolog (after the XML declaration). 3

o Syntax: <?xml-stylesheet type="text/css" href="path/to/[Link]"?>


o type="text/css": Specifies the stylesheet language is CSS.
o href: Provides the URI of the external CSS file.
• CSS Selectors for XML: Standard CSS selectors work on XML documents. 38

o Type Selectors: Target elements by their tag name (e.g., book, title, author).
Remember XML is case-sensitive.
o Attribute Selectors: Target elements based on their attributes (e.g.,
book[category="fiction"], phone[type="mobile"]).
o ID and Class Selectors: Can be used if the XML attributes id and class are
present (e.g., #summary, .important).
o Descendant, Child, Sibling Selectors: (, >, +, ~) work as they do in HTML.

18
• Display Properties: A crucial difference from HTML is that XML elements
generally have no default display semantics in the browser. Therefore, CSS rules
for XML often need to explicitly set the display property (e.g., display: block;,
display: inline;, display: table-cell;) to control layout. 42

• Example:
o XML ([Link]):
XML
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="report_style.css"?>
<report>
<title>Annual Sales Report</title>
<section id="summary">

<heading>Executive Summary</heading>
<paragraph>Overall sales demonstrated a significant increase of
<emphasis>15%</emphasis> compared to the previous fiscal year.</paragraph>
</section>

<section id="details">

<heading>Detailed Analysis</heading>
<paragraph>Region North experienced exceptional growth, exceeding targets
by 25%.</paragraph>
<paragraph>Region South showed steady performance but requires strategic
focus for improvement.</paragraph>
<item_list>

<item>Product A: +20%</item>
<item>Product B: +5%</item>
</item_list>

</section>
</report>

o CSS (report_style.css):
CSS

19
/* Basic document styling */
report {
display: block; /* Treat root like a block */
font-family: Verdana, Geneva, sans-serif;
margin: 2em;

border: 1px solid #aaa;


padding: 1em;

background-color: #f8f8f8;

/* Title styling */
title {
display: block; /* Make title a block */
font-size: 2em;

font-weight: bold;
color: #2a2a7e;

margin-bottom: 1em;

text-align: center;
border-bottom: 2px solid #2a2a7e;
padding-bottom: 0.5em;

/* Section styling */
section {
display: block; /* Make sections blocks */
margin-bottom: 1.5em;

padding: 1em;

border: 1px dashed #ccc;


}

/* Specific section styling */


section#summary {

20
background-color: #eef; /* Light blue background for summary */

/* Heading styling */
heading {
display: block; /* Make headings blocks */
font-size: 1.4em;

font-weight: bold;
color: #333;

margin-bottom: 0.7em;

/* Paragraph styling */
paragraph {
display: block; /* Make paragraphs blocks */
margin-bottom: 0.5em;

line-height: 1.5;

/* Emphasis styling */
emphasis {
font-style: italic;
color: #006400; /* Dark green */

font-weight: bold;
}

/* Item list styling */


item_list {
display: block;
margin-top: 1em;

margin-left: 2em;

21
item {
display: list-item; /* Render as list items */
list-style-type: square;
margin-bottom: 0.3em;

(Based on concepts from ) 3

• Browser Rendering: When [Link] is opened in a browser that supports


XML+CSS rendering, it will display the content formatted according to the rules in
report_style.css. Note that browser support and consistency for styling arbitrary
XML with CSS can sometimes be less robust than for HTML. Loading local XML
files with local CSS might also trigger security restrictions in some browsers;
serving them from a web server is often necessary. 42

3.2 Transforming XML with XSLT

Extensible Stylesheet Language Transformations (XSLT) provides a more powerful


mechanism than CSS for presenting XML data. It is an XML-based language
specifically designed to transform the structure and content of an input XML document
into another document, which could be HTML, a different XML vocabulary, plain text, or
other formats. 3

• Concept: An XSLT processor takes two inputs: the source XML document and an
XSLT stylesheet (itself an XML document). It processes the source tree based on
rules (templates) defined in the stylesheet and generates a result tree. The original
18

source XML document remains unchanged. 18

• Capabilities: XSLT goes beyond simple styling. It can:


o Rearrange and reorder elements from the source document. 18

o Add or remove elements and attributes.


o Filter content based on conditions.
o Sort data. 18

o Perform calculations and manipulate text content.

22
o Combine data from multiple XML sources (using the document() function ). 50

o Generate completely different output formats (e.g., transforming XML data into
an HTML webpage). 3

• Stylesheet Structure:
o XML Document: An XSLT stylesheet must be a well-formed XML document. 44

o Root Element: The root element must be <xsl:stylesheet> or <xsl:transform>. 44

o Namespace: It must declare the XSLT namespace:


xmlns:xsl="[Link] The xsl prefix is
44

conventional but can be changed. 44

o Version: The version attribute is mandatory (e.g., version="1.0"). 44

o Templates (<xsl:template>): Contain rules for processing nodes matched by an


XPath expression in the match attribute. 18

• Linking XSLT to XML: Similar to CSS, the xml-stylesheet processing instruction is


used in the source XML document's prolog. 3

o Syntax: <?xml-stylesheet type="text/xsl" href="path/to/[Link]"?>


o type="text/xsl" indicates an XSLT stylesheet (though text/xml is sometimes seen
).
44

3.3 Core XSLT Elements and Basic XPath

XSLT relies heavily on XPath to select nodes from the source XML tree for processing. 44

• XPath Basics (Recap): Path expressions navigate the source tree.


o /: Root node.
o elementName: Selects child elements with that name.
o //: Selects nodes anywhere in the document (descendant-or-self).
o @attributeName: Selects an attribute.
o .: The current context node.
o ..: The parent of the context node.
o *: Wildcard for any element.
o [...]: Predicates for filtering based on conditions (e.g., [@id='c001'], [position()=1],
[price > 10]). (See Module 7 for a full XPath deep dive; concepts from ) 52

• Key XSLT Elements: (Referenced elements listed in ) 3

23
o <xsl:template match="xpath-pattern">: Defines a processing rule for nodes
matching the pattern. The pattern / matches the root of the entire document. 42

o <xsl:apply-templates select="xpath-expression">: Processes the children of


the current node or nodes selected by the select expression, finding the best
matching template for each selected node. If select is omitted, it processes all
children. This enables the recursive, template-driven nature of XSLT. 42

o <xsl:value-of select="xpath-expression">: Extracts the string value of the first


node selected by the expression and outputs it as text in the result tree. 42

o <xsl:for-each select="xpath-expression">: Iterates over each node in the


sequence selected by the expression, executing the content of the for-each
element for every node in the sequence. The context node changes to the
47

current node in the iteration.


o <xsl:if test="boolean-xpath-expression">: Executes its content only if the test
expression evaluates to true.
o <xsl:choose>, <xsl:when test="...">, <xsl:otherwise>: Implements multi-
branch conditional logic, similar to if-elseif-else or a switch statement.
o <xsl:attribute name="attr-name">: Creates an attribute node in the result tree.
The content of this element becomes the attribute's value. Often used inside
literal result elements or <xsl:element>.
o <xsl:element name="elem-name">: Creates an element node in the result tree
with the specified name. The content of this element becomes the element's
content. Useful for creating elements with computed names.
o <xsl:copy-of select="xpath-expression">: Performs a deep copy of the
selected nodes (including their attributes and descendants) to the result tree.
o <xsl:copy>: Performs a shallow copy of the current node (the node itself, without
attributes or children). Frequently used with <xsl:apply-templates
select="@*|node()"/> to copy the current node and recursively process its
attributes and children, forming the basis of an identity transform. 42

o <xsl:output method="xml|html|text" indent="yes|no">: Specifies the desired


format of the output document and whether the processor should attempt to add
indentation (pretty-print). 42

24
o Literal Result Elements (LREs): Any element in the stylesheet that is not in the
XSLT namespace (i.e., doesn't have the xsl: prefix) is treated as a literal result
element. It is copied directly to the output tree. Attributes on LREs are also
copied, and attribute value templates (using {xpath-expression} within attribute
values) allow dynamic computation of attribute values. Example: <p
18

class="{$category}">...</p>.

3.4 Example: Transforming XML to HTML

Let's transform the [Link] file into an HTML table.

• XML ([Link] - from Module 1):


XML
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="contacts_to_html.xsl"?> <contacts>
<contact id="c001">

<name>Alice Wonderland</name>
<email>alice@[Link]</email>

<phone type="mobile">123-456-7890</phone>

</contact>

<contact id="c002">

<name>Bob The Builder</name>


<email>bob@[Link]</email>

<phone type="work">987-654-3210</phone>

<phone type="mobile">555-555-5555</phone>

</contact>
</contacts>

• XSLT (contacts_to_html.xsl):
XML
<xsl:stylesheet version="1.0" xmlns:xsl="[Link]

<xsl:output method="html" indent="yes"/>

25
<xsl:template match="/">

<html>

<head>

<title>Contact List</title>
<style>
table, th, td { border: 1px solid black; border-collapse: collapse; padding: 5px; }
th { background-color: #f2f2f2; }
ul { list-style-type: none; margin: 0; padding: 0; }
</style>
</head>

<body>

<h1>Contacts</h1>

<table>

<thead>

<tr>

<th>ID</th>

<th>Name</th>

<th>Email</th>

<th>Phone Numbers</th>
</tr>

</thead>

<tbody>

<xsl:apply-templates select="/contacts/contact"/>

</tbody>

</table>

</body>

</html>

</xsl:template>

<xsl:template match="contact">

<tr>

26
<td><xsl:value-of select="@id"/></td>

<td><xsl:value-of select="name"/></td>

<td><xsl:value-of select="email"/></td>

<td>

<ul>

<xsl:apply-templates select="phone"/>

</ul>

</td>

</tr>

</xsl:template>

<xsl:template match="phone">

<li>

<xsl:value-of select="."/>

(<xsl:value-of select="@type"/>)
</li>

</xsl:template>

</xsl:stylesheet>

(Based on concepts from ) 18

• Resulting HTML (Conceptual Output):


HTML
<html>
<head>
<title>Contact List</title>
<style>
table, th, td { border: 1px solid black; border-collapse: collapse; padding: 5px; }
th { background-color: #f2f2f2; }
ul { list-style-type: none; margin: 0; padding: 0; }
</style>
</head>
<body>

27
<h1>Contacts</h1>

<table>

<thead>

<tr>

<th>ID</th>

<th>Name</th>

<th>Email</th>

<th>Phone Numbers</th>
</tr>

</thead>

<tbody>

<tr>

<td>c001</td>

<td>Alice Wonderland</td>
<td>alice@[Link]</td>

<td>

<ul>

<li>123-456-7890 (mobile)</li>
</ul>

</td>

</tr>

<tr>

<td>c002</td>

<td>Bob The Builder</td>


<td>bob@[Link]</td>

<td>

<ul>

<li>987-654-3210 (work)</li>
<li>555-555-5555 (mobile)</li>
</ul>

</td>

28
</tr>

</tbody>

</table>
</body>
</html>

This HTML, generated by the XSLT processor, can be directly rendered by any web
browser.

3.5 Module Insights & Connections

The contrast between CSS and XSLT for XML presentation highlights a fundamental
difference in approach. CSS acts as a non-invasive layer, applying display rules to the
existing XML structure primarily for browser rendering. It modifies how the data looks.
3

XSLT, conversely, is a transformation language; it actively restructures the source XML,


potentially filtering, sorting, and calculating new values, to generate an entirely new
output document, often in HTML format. It modifies what the output structure is and
3

what content it contains. This transformative power makes XSLT far more versatile for
complex presentation scenarios where the desired output structure differs significantly
from the source XML structure, or where data manipulation is required before display.
CSS is simpler for direct styling when the XML structure closely matches the desired
display layout.

The integral role of XPath within XSLT underscores XPath's significance as more than
just a standalone query language. It serves as the essential mechanism for selecting
44

nodes within the source document that XSLT templates will process (match attribute)
and for extracting data or selecting further nodes for processing within those templates
(select attribute in elements like <xsl:value-of>, <xsl:apply-templates>, <xsl:for-each>).
This deep integration means that proficiency in XPath is a prerequisite for effective
XSLT development. Furthermore, XPath's utility extends beyond XSLT, being crucial for
programmatic DOM navigation and forming the basis for XQuery , positioning it as a
51 55

core technology across various XML processing domains. Understanding XPath


provides leverage in multiple areas of XML technology.

29
3.6 Practical Idea: Create CSS/XSLT for Contacts XML

Task: Using the [Link] file from previous modules:

1. CSS Styling:
o Create a CSS file ([Link]).
o Write CSS rules to style the contacts, contact, name, email, and phone elements
for clear display in a browser. For example:
▪ Make each contact a block with a border and margin.
▪ Style the name element prominently (e.g., larger font, bold).
▪ Display email and phone elements clearly, perhaps using pseudo-elements
(::before) to add labels like "Email: " or "Phone: ".
▪ Style phone numbers differently based on their type attribute using attribute
selectors (e.g., phone[type="mobile"] { color: green; }).
o Link this CSS file to [Link] using <?xml-stylesheet?>.
o Open [Link] in a browser (using a local web server if needed) to view the
styled output.
2. XSLT Transformation:
o Create an XSLT file (contacts_to_html.xsl) similar to the example in section 3.4.
o Ensure it transforms the [Link] data into a well-structured HTML page
(e.g., using an HTML table or a definition list <dl>).
o Include columns/sections for ID, Name, Email, and Phone(s) with type.
o Link this XSLT file to [Link] using <?xml-stylesheet?> (replacing the CSS
link).
o Open [Link] in a browser to view the generated HTML output.
3. Comparison: Compare the visual results and the development process. Note
the differences in control over structure and content between the CSS and XSLT
approaches. Discuss which approach might be better suited depending on whether
the goal is simple styling versus significant restructuring for presentation.

Module 4: Related XML Standards

30
Beyond the core XML specification, DTDs, XSDs, CSS, and XSLT, several other related
standards and technologies are important within the broader XML ecosystem. This
module focuses on XHTML and the Document Object Model (DOM).

4.1 XHTML (Extensible HyperText Markup Language)

XHTML emerged as an effort by the W3C to bridge the gap between the widespread
use of HTML and the stricter syntax requirements of XML. 19

• Definition: XHTML is essentially a reformulation of HTML (specifically HTML 4.01)


that adheres to the syntax rules of XML. The goal was to create a version of HTML
19

that was well-formed XML, making it more suitable for automated processing and
integration with other XML technologies. 58

• Relationship to HTML & XML: It uses the familiar vocabulary (tags and attributes)
of HTML but enforces XML's stricter syntax rules. It was intended to be the
19

successor to HTML 4. However, the development trajectory shifted; the WHATWG


59

group continued developing HTML, leading to HTML5, which became the dominant
standard. HTML5 itself is designed to be backward-compatible and handle much
59

of the markup previously considered XHTML, although its parsing rules are more
lenient than XML's. The term "XHTML" is now less frequently used to describe the
60

XML serialization of HTML. 61

• Key Syntax Differences from HTML (Recap): As an XML application, XHTML


requires :19

o Well-formedness: Must adhere to all XML syntax rules.


o Proper Nesting: No overlapping tags.
o Mandatory Closing Tags: All elements must have an explicit closing tag (e.g.,
<p>...</p>) or be self-closing (e.g., <br />, <img... />).
o Quoted Attributes: All attribute values must be enclosed in single or double
quotes.
o No Attribute Minimization: Attributes must have explicit values (e.g.,
checked="checked" instead of just checked).
o Lowercase: Element and attribute names must be in lowercase.

31
• MIME Type and Parsing: The critical factor determining how a document is
processed is the MIME type sent by the server. 19

o text/html: The document is parsed using an HTML parser, which is typically


lenient and performs error correction. Most documents written with XHTML
19

syntax on the web are served this way, effectively making them invalid HTML
processed with HTML rules. 19

o application/xhtml+xml: The document is parsed using an XML parser, which


enforces strict well-formedness rules. A single syntax error will typically
19

prevent rendering. This MIME type caused issues with older browsers (notably
Internet Explorer) that did not support it properly. 19

• XML Namespace: XHTML documents parsed as XML require the XHTML


namespace declaration, typically on the root <html> element:
xmlns="[Link] 41

While the push for XHTML as the primary web standard waned with the rise of HTML5,
understanding its principles remains valuable as it represents the application of XML's
rigor to web documents and influenced the design of HTML5.

4.2 Document Object Model (DOM)

The Document Object Model (DOM) is a crucial standard for interacting with structured
documents like HTML and XML programmatically.

• Concept: The DOM is a cross-platform, language-neutral programming interface


(API) that represents the structure, style, and content of an HTML or XML
document as a logical tree of objects in memory. Each part of the document
54

(elements, attributes, text, comments, etc.) corresponds to a node in this tree. 64

• Purpose: It provides a standard way for programs and scripts (most commonly
JavaScript in web browsers) to dynamically access, traverse, and manipulate the
content, structure, and style of documents. Essentially, the DOM connects the
54

static document markup to dynamic scripting languages. 66

• Tree Structure: The DOM organizes the document hierarchically : 54

o Nodes: The fundamental units (Element, Attr, Text, Comment, Document, etc.). 64

32
o Relationships: Nodes have relationships like parent, child, sibling, ancestor,
descendant. 54

o Root Node: The top-level node representing the entire document (e.g., the
Document node itself, or the <html> element in HTML). 54

• Language-Neutral Standard: Although most web developers interact with the


DOM via JavaScript , the DOM specification itself is independent of any
63

programming language. Parsers and DOM implementations exist for many


63

languages, including Python , Java , Perl, and others. W3C and WHATWG
63 36 64

manage the DOM standards. 64

• Core Interfaces (Conceptual Overview): The DOM defines standard interfaces


(like classes or object types) that represent different parts of the document and
provide methods and properties for interaction. Key interfaces include:
o Document: Represents the entire XML or HTML document. Acts as the entry
point for accessing nodes. Provides methods like createElement(),
createTextNode(), getElementById(), getElementsByTagName(),
querySelector(), querySelectorAll(). 54

o Node: The base interface from which most other DOM interfaces inherit.
Provides fundamental properties like nodeName, nodeValue, nodeType,
parentNode, childNodes, firstChild, lastChild, textContent, and methods like
appendChild(), removeChild(), insertBefore(), cloneNode(). 54

o Element: Represents an element node (e.g., <book>, <p>). Inherits from Node.
Provides element-specific methods like getAttribute(), setAttribute(),
removeAttribute(), getElementsByTagName() (scoped to the element). 54

o Attr: Represents an attribute node. Accessed typically via Element methods


rather than direct tree traversal. 63

o Text: Represents the textual content within an element or attribute. Inherits from
64

CharacterData, which inherits from Node.


o NodeList, HTMLCollection, NamedNodeMap: These are collection interfaces
representing ordered or unordered lists of nodes (e.g., returned by childNodes
or getElementsByTagName). They provide methods like item() and properties
63

like length.

33
4.3 DOM Access and Manipulation Example (JavaScript)

The following example demonstrates basic DOM manipulation using JavaScript within
an HTML page. The same principles apply when working with an XML document parsed
into a DOM structure using tools like DOMParser.

HTML (dom_example.html):

HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">

<title>DOM Manipulation Example</title>


<style>.highlight { background-color: yellow; font-weight: bold; } </style>
</head>
<body>
<h1>Working with the DOM</h1>
<p id="intro">This is the introduction paragraph.</p>
<div id="content">

<p>First item.</p>
<p>Second item.</p>
</div>

<button id="changeBtn">Change Content</button>


<button id="addBtn">Add Item</button>

<script>
// --- DOM Manipulation Script ---

// 1. Selecting Elements
const introPara = [Link]('intro');
const contentDiv = [Link]('#content'); // Using CSS selector
const allParas = [Link]('p');
const changeButton = [Link]('changeBtn');
const addButton = [Link]('addBtn');

34
// Function to run on button click
function modifyContent() {
// 2. Accessing/Modifying Text Content
[Link] = "Introduction paragraph has been updated!";

// 3. Accessing/Modifying Attributes
[Link]('data-status', 'updated');
[Link]("Intro para status:", [Link]('data-status'));

// 4. Changing Styles
if ([Link] > 1) {
allParas.[1][Link] = 'blue';
allParas.[1][Link] = 'bold';
}

// Add a CSS class


[Link]('highlight');
}

// Function to add a new item


function addItem() {
// 5. Creating New Elements
const newPara = [Link]('p');
[Link] = 'Newly added item.';

// 6. Adding Elements
[Link](newPara);

// Example: Removing the first paragraph inside the div (if it exists)
const firstItem = [Link]('p');
if (firstItem && [Link] > 1) {
// [Link](firstItem); // Uncomment to enable removal
// [Link]("Removed first item.");
}
}

35
// Attach event listeners to buttons
[Link]('click', modifyContent);
[Link]('click', addItem);

</script>
</body>
</html>

Conceptual XML DOM Manipulation (JavaScript):

If we had parsed [Link] into an xmlDoc object using DOMParser , we could 63

apply similar logic:

JavaScript

// Assume xmlDoc holds the parsed DOM of [Link]

// Get all contact elements


let contacts = [Link]('contact');
[Link](`Found ${[Link]} contacts.`);

for (let contact of contacts) {


// Get data using DOM methods

let id = [Link]('id');
let nameElement = [Link]('name'); // Get first <name> child
let name = nameElement? [Link] : 'N/A'; // Access text content safely
let emailElement = [Link]('email');
let email = emailElement? [Link] : 'N/A';

[Link](`Processing Contact ID: ${id}, Name: ${name}, Email: ${email}`);

// Modify an attribute

[Link]('status', 'processed');
[Link](` Set status to: ${[Link]('status')}`);

36
// Add a new 'last_updated' element

let updatedElement = [Link]('last_updated'); // Create new element


[Link] = new Date().toISOString(); // Set its text content
[Link](updatedElement); // Append it to the current contact

// Note: To save these changes, the modified xmlDoc would need to be

// serialized back into an XML string using XMLSerializer.

// Example of serialization (conceptual)


// const serializer = new XMLSerializer();
// const updatedXmlString = [Link](xmlDoc);
// [Link]("\nUpdated XML:\n", updatedXmlString);

(Based on concepts from ) 54

This illustrates how the standard DOM API provides a consistent way to interact with the
structure of both HTML and XML documents once they are loaded into memory.

4.4 Module Insights & Connections

The Document Object Model provides a critical abstraction layer, transforming a static
text-based markup document (HTML or XML) into a dynamic, in-memory object
representation. This object-oriented perspective is fundamental because it exposes the
63

document's components—elements, attributes, text—as objects with properties and


methods. This allows scripting languages like JavaScript to interact with the document
not as raw text, but as a structured collection of manipulable objects. Without this
54

standardized object model, dynamically modifying web page content or


programmatically processing XML data in a structured way would be significantly more
complex and non-standardized.

37
The DOM standard encompasses both generic interfaces applicable to any XML
document (like Node, Element, Document) and more specialized interfaces for HTML
(like HTMLElement, HTMLInputElement). This layered design reflects the dual role of
63

browsers in handling both generic XML and the specific semantics of HTML. It allows
developers to use common, fundamental DOM manipulation techniques across both
types of documents (leveraging the generic interfaces), while still providing access to
HTML-specific properties and methods when needed. This promotes consistency and
64

code reuse, as the basic principles of node traversal, creation, and modification apply
broadly, regardless of whether the underlying document is a generic XML data file or a
rich HTML web page.

4.5 Practical Idea: JavaScript DOM Manipulation of XML

Task: Create a simple web page (xml_loader.html) that demonstrates loading and
displaying data from an XML file using JavaScript and the DOM.

1. HTML Structure: Create an HTML file with:


o A button (e.g., <button id="loadBtn">Load Contacts</button>).
o An empty div or ul element to display the results (e.g., <ul id="contactList"></ul>).
2. XML File: Use the [Link] file created previously. Ensure it's accessible
(e.g., in the same directory or served by a local web server).
3. JavaScript Logic: Add a <script> block (or link an external JS file) that:
o Adds an event listener to the button.
o When the button is clicked:
▪ Uses the fetch API to asynchronously retrieve [Link].
▪ Gets the response text using [Link]().
▪ Creates a DOMParser instance. 63

▪ Parses the XML text into an XML Document object using


[Link](xmlText, "text/xml"). 63

▪ Selects the <ul> element where results will be displayed. Clears any
previous content from it.
▪ Uses DOM methods (getElementsByTagName, getAttribute, textContent) to
iterate through the <contact> elements in the parsed xmlDoc. 63

38
▪ For each contact, extract the id, name, and email.
▪ Creates a new HTML <li> element ([Link]('li')).
▪ Sets the textContent of the <li> to display the extracted contact information
(e.g., ID: c001, Name: Alice Wonderland, Email: alice@[Link]).
▪ Appends the newly created <li> element to the results <ul> using
appendChild.
o Include basic error handling (e.g., for fetch errors or parsing errors).

Module 5: Programming with XML


Processing XML data programmatically involves parsing (reading and interpreting) XML
documents and often generating (creating) new XML documents. Various strategies and
libraries exist in popular programming languages like Python and Java to facilitate these
tasks.

5.1 XML Parsing Strategies: DOM vs. SAX

Two primary models dominate XML parsing: the Document Object Model (DOM) and
the Simple API for XML (SAX).

• DOM Parsing:
o Mechanism: The parser reads the entire XML document and constructs a
complete, hierarchical tree representation of the document in memory. Each36

element, attribute, and text segment becomes a node object in this tree.
o Advantages:
▪ Random Access: Allows navigation throughout the entire document tree in
any direction (parent, child, sibling) and modification of any node after
parsing is complete. 36

▪ Simpler Logic (for some tasks): Easier to implement logic that requires
knowledge of the document's overall structure or involves complex
structural manipulations.
o Disadvantages:

39
▪ Memory Intensive: Requires loading the entire document structure into
memory, making it unsuitable for very large XML files that exceed available
RAM. 69

▪ Initial Latency: Can have a higher startup time as the entire document
must be read and parsed before processing can begin. 71

• SAX Parsing:
o Mechanism: SAX is an event-driven, sequential access parser. The parser
36

reads the XML document sequentially from beginning to end. As it encounters


specific structures (start of element, end of element, character data, start/end of
document), it triggers corresponding callback methods (events) in a handler
object provided by the application. It does not build an in-memory tree.
70 70

o Advantages:
▪ Memory Efficient: Processes the XML document in a stream, requiring
very little memory regardless of the document size. Ideal for huge files. 33

▪ Fast Startup: Processing can begin as soon as the parser starts reading
the document.
o Disadvantages:
▪ Forward-Only: Cannot navigate backward or randomly access parts of the
document that have already been processed. 70

▪ State Management: The application's handler needs to maintain its own


state to understand the context of the events (e.g., tracking which element
is currently open). This can make the programming logic more complex.
73

▪ Modification Difficult: Not suitable for tasks requiring modification of the


XML structure during parsing.
• Other Approaches:
o StAX (Streaming API for XML): Primarily used in Java, StAX is another
streaming API, but it uses a pull model instead of SAX's push model. The
application requests (pulls) the next event from the parser, giving the
application more control over the parsing process. 75

o Iterative Parsing: Libraries like Python's ElementTree and lxml offer iterparse
functions. This approach combines aspects of SAX and DOM. It parses the
8

40
document incrementally like SAX but can yield partial element trees as they are
completed. Crucially, it allows the application to process an element and then
discard it ([Link]()) to free up memory, providing a balance between ease
of use and memory efficiency for large files. 35

The choice between DOM and SAX (or related streaming/iterative methods) hinges on
the specific requirements: DOM is often preferred for smaller documents or when
random access and modification are needed, while SAX or iterative parsing is
necessary for large documents where memory consumption is a primary concern. This 70

reflects a classic space-time trade-off: DOM prioritizes ease of access (time/complexity)


at the cost of memory (space), while SAX prioritizes memory efficiency at the cost of
access flexibility and potentially more complex application logic.

5.2 Parsing XML in Python

Python offers several libraries for XML parsing.

• Libraries:
o [Link] (ET): A built-in module providing a simple, lightweight,
and "Pythonic" API for parsing and creating XML. Good for basic XML tasks.
4

Represents the XML as a tree of Element objects. Supports basic XPath


77

expressions. Includes iterparse for memory-efficient processing of large files.


8 8

o lxml: A powerful third-party library (requires installation: pip install lxml) built on
top of the C libraries libxml2 and libxslt. Generally faster and more feature-rich
78

than ElementTree. Offers excellent support for XPath 1.0, CSS selectors,
71

XSLT, and schema validation (XSD). Also provides an ElementTree-compatible


4

API and iterparse. Often the recommended choice for complex XML processing
4

or performance-critical applications.
o [Link]: A built-in module providing a W3C DOM Level 1
implementation. Parses the entire document into a DOM tree. Generally more
35

memory-intensive and less Pythonic than ElementTree. 71

o [Link]: A built-in module providing a SAX 2 interface. Requires implementing


35

event handlers. Used for low-memory streaming processing.

41
• ElementTree Parsing Example:
Python
import [Link] as ET

try:

# Parse from file

tree = [Link]('[Link]')
root = [Link]() # Get the root <contacts> element

# Or parse from an XML string:

# xml_string = "<contacts>...</contacts>"

# root = [Link](xml_string)

print(f"Root element tag: {[Link]}")

# Iterate through direct children (<contact> elements)

for contact in root: # Direct iteration over children


contact_id = [Link]('id') # Get attribute value
name = [Link]('name').text # Find first 'name' child and get text
email = [Link]('email') # Shortcut for find().text

print(f"\nContact ID: {contact_id}")


print(f" Name: {name}")
print(f" Email: {email}")

# Find all 'phone' children of the current contact

for phone in [Link]('phone'):


phone_number = [Link]
phone_type = [Link]('type')
print(f" Phone: {phone_number} (Type: {phone_type})")

42
except FileNotFoundError:
print("Error: [Link] not found.")
except [Link] as e:
print(f"Error parsing XML: {e}")

(Combines concepts from ) 4

• lxml Parsing Example (using XPath):


Python
# Requires: pip install lxml
from lxml import etree

try:

# Parse from file

tree = [Link]('[Link]')
root = [Link]()

# Or parse from string:

# xml_string = b"<contacts>...</contacts>" # lxml often prefers bytes

# root = [Link](xml_string)

print(f"Root element tag: {[Link]}")

# Use XPath to select all contact elements

contacts = [Link]('/contacts/contact') # Absolute path


# Alternative: find contacts with a specific ID

# contacts = [Link]("//contact[@id='c001']")

for contact in contacts:


# Use relative XPath expressions from the contact node

contact_id = [Link]('@id') # Select attribute value


name = [Link]('./name/text()') # Select text content directly
email = [Link]('./email/text()')
43
print(f"\nContact ID: {contact_id}")
print(f" Name: {name}")
print(f" Email: {email}")

# Select phone numbers using XPath

phones = [Link]('./phone')
for phone in phones:
phone_number = [Link]('./text()')
phone_type = [Link]('@type')
print(f" Phone: {phone_number} (Type: {phone_type})")

except FileNotFoundError:
print("Error: [Link] not found.")
except [Link] as e: # lxml specific parse error
print(f"Error parsing XML: {e}")

(Combines concepts from ) 4

• Iterative Parsing (iterparse): For very large files, [Link]() or


[Link]() can be used. The key is to process the element during an
'end' event and then call [Link]() and potentially remove the element from its
parent to reclaim memory. 35

5.3 Generating XML in Python

Both ElementTree and lxml provide convenient ways to construct XML structures
programmatically.

• ElementTree Generation Example:


Python
import [Link] as ET

# Create the root element

44
root = [Link]('orders')
[Link]('xmlns', '[Link] # Example namespace

# Create the first order element


order1 = [Link](root, 'order')
[Link]('orderId', 'o101') # Add attribute

# Add child elements to order1


item1 = [Link](order1, 'item')
[Link]('itemId', 'i50')
[Link] = 'Laptop' # Add text content

qty1 = [Link](order1, 'quantity')


[Link] = '1'

# Create the second order using alternative attribute syntax


order2 = [Link](root, 'order', {'orderId': 'o102', 'status': 'pending'})

item2 = [Link](order2, 'item', {'itemId': 'i60'})


[Link] = 'Keyboard'

qty2 = [Link](order2, 'quantity')


[Link] = '2'

# Wrap the root element in an ElementTree object for writing


tree = [Link](root)

# Write to file with XML declaration and UTF-8 encoding


# Use [Link](root) for pretty printing in Python 3.9+
try: # Python 3.9+ indent for pretty printing

[Link](root, space=" ", level=0)


except AttributeError:

45
print("Note: [Link] requires Python 3.9+ for pretty printing.")

[Link]('output_orders.xml', encoding='utf-8', xml_declaration=True)

print("Generated XML file: output_orders.xml")

(Combines concepts from ) 4

• lxml Generation Example: (API is very similar to ElementTree)


Python
# Requires: pip install lxml
from lxml import etree

# Create the root element with namespace


# Note: lxml handles namespaces more explicitly
NSMAP = {"o": "[Link] # Namespace map
root = [Link]("{[Link] nsmap=NSMAP)

# Create the first order element


order1 = [Link](root, "{[Link]
[Link]('orderId', 'o101') # Attributes don't need namespace prefix here

# Add child elements to order1


item1 = [Link](order1, "{[Link]
[Link]('itemId', 'i50')
[Link] = 'Laptop'

qty1 = [Link](order1, "{[Link]


[Link] = '1'

# Create the second order


order2 = [Link](root, "{[Link] orderId='o102',
status='pending')

46
item2 = [Link](order2, "{[Link] itemId='i60')
[Link] = 'Keyboard'

qty2 = [Link](order2, "{[Link]


[Link] = '2'

# Create tree object (optional for writing with lxml)


tree = [Link](root)

# Write to file with pretty printing, XML declaration, and encoding


[Link]('output_orders_lxml.xml',
pretty_print=True,
xml_declaration=True,
encoding='utf-8')

# Alternatively, get the string and write manually


# xml_string = [Link](root, pretty_print=True, xml_declaration=True, encoding='utf-8')
# with open('output_orders_lxml.xml', 'wb') as f:
# [Link](xml_string)

print("Generated XML file: output_orders_lxml.xml")

(Combines concepts from ) 4

5.4 Parsing XML in Java (JAXP)

The Java API for XML Processing (JAXP) is the standard Java API for handling XML,
supporting DOM, SAX, StAX, and XSLT transformations. It provides an abstraction
70

layer allowing different underlying parser implementations (like Xerces) to be plugged


in. 70

• DOM Parsing Example: Uses DocumentBuilderFactory and DocumentBuilder to


create an in-memory Document (DOM tree). 36

47
Java
import [Link].*;
import [Link].*;
import [Link];

public class JavaDOMParser {


public static void main(String args) {
try {
File inputFile = new File("[Link]");

// 1. Get Factory and Builder

DocumentBuilderFactory dbFactory =
[Link]();
DocumentBuilder dBuilder = [Link]();

// 2. Parse File to Document (DOM Tree)

Document doc = [Link](inputFile);

// Optional: Normalize the tree (merges adjacent text nodes, removes empty ones)

[Link]().normalize();

[Link]("Root element: " +


[Link]().getNodeName());

// 3. Get NodeList of contacts

NodeList contactList = [Link]("contact");

// 4. Iterate through NodeList

for (int i = 0; i < [Link](); i++) {


Node contactNode = [Link](i);

48
if ([Link]() == Node.ELEMENT_NODE) {
Element contactElement = (Element) contactNode;

// 5. Access Attributes

String id = [Link]("id");

// 6. Access Child Element Text Content

String name =
[Link]("name").item(0).getTextContent();
String email =
[Link]("email").item(0).getTextContent();

[Link]("\nContact ID: %s%n", id);


[Link](" Name: %s%n", name);
[Link](" Email: %s%n", email);

// Access multiple phone elements

NodeList phoneList =
[Link]("phone");
for (int j = 0; j < [Link](); j++) {
Element phoneElement = (Element) [Link](j);
String number = [Link]();
String type = [Link]("type");
[Link](" Phone: %s (Type: %s)%n", number, type);
}
}
}
} catch (Exception e) {
[Link](); // Handle parsing/IO exceptions
}
}

49
}

(Based on concepts from ) 36

• SAX Parsing Example: Uses SAXParserFactory, SAXParser, and a custom class


extending DefaultHandler to react to parsing events. 70

Java
import [Link].*;
import [Link].*;
import [Link];
import [Link];

public class JavaSAXParser {


public static void main(String args) {
try {
File inputFile = new File("[Link]");

// 1. Get Factory and Parser

SAXParserFactory factory = [Link]();


SAXParser saxParser = [Link]();

// 2. Create Handler instance

ContactHandler handler = new ContactHandler();

// 3. Parse File with Handler

[Link](inputFile, handler);

} catch (Exception e) {
[Link](); // Handle parsing/IO exceptions
}
}
}

50
// 4. Custom Handler extending DefaultHandler
class ContactHandler extends DefaultHandler {

// State variables to track context

boolean bName = false;


boolean bEmail = false;
boolean bPhone = false;
String currentContactId = null;
String currentPhoneType = null;
StringBuilder elementContent; // Use StringBuilder for efficiency

@Override

public void startDocument() throws SAXException {

[Link]("Parsing started...");
}

@Override

public void endDocument() throws SAXException {

[Link]("Parsing finished.");
}

@Override

public void startElement(String uri, String localName, String qName, Attributes attributes) throws

SAXException {

elementContent = new StringBuilder(); // Reset content buffer for new element

if ([Link]("contact")) {
currentContactId = [Link]("id"); // Get attribute value
[Link]("\nStart Contact (ID: %s)%n", currentContactId);
} else if ([Link]("name")) {

51
bName = true;
} else if ([Link]("email")) {
bEmail = true;
} else if ([Link]("phone")) {
bPhone = true;
currentPhoneType = [Link]("type"); // Get phone type attribute
}
}

@Override

public void endElement(String uri, String localName, String qName) throws SAXException {

if (bName) {
[Link](" Name: " + [Link]());
bName = false;
} else if (bEmail) {
[Link](" Email: " + [Link]());
bEmail = false;
} else if (bPhone) {
[Link](" Phone: %s (Type: %s)%n", [Link](),
currentPhoneType);
bPhone = false;
currentPhoneType = null; // Reset phone type
}

if ([Link]("contact")) {
[Link]("End Contact");
currentContactId = null; // Reset contact ID
}
}

@Override

52
public void characters(char ch, int start, int length) throws SAXException {

// Append character data to the buffer; handle potential multi-calls

[Link](ch, start, length);


}

@Override

public void warning(SAXParseException e) throws SAXException {

[Link]("Warning: " + [Link]());


}

@Override

public void error(SAXParseException e) throws SAXException {

[Link]("Error: " + [Link]());


}

@Override

public void fatalError(SAXParseException e) throws SAXException {

[Link]("Fatal Error: " + [Link]());


throw e; // Re-throw fatal errors
}
}

(Based on concepts from ) 70

5.5 Generating XML in Java

Java offers several ways to generate XML, including using the DOM API or higher-level
libraries like JAXB.

• DOM Generation Example: Involves using DocumentBuilder to create a


Document, then creating and appending Element, Attr, and Text nodes. Finally, a
Transformer is used to serialize the DOM tree to a file or stream. 36

Java

53
import [Link].*;
import [Link].*;
import [Link];
import [Link];
import [Link].*;
import [Link];

public class JavaDOMGenerator {


public static void main(String args) {
try {
// 1. Get Factory and Builder

DocumentBuilderFactory dbFactory =
[Link]();
DocumentBuilder dBuilder = [Link]();

// 2. Create a new Document

Document doc = [Link]();

// 3. Create Root Element

Element rootElement = [Link]("orders");


[Link](rootElement);

// --- Create Order 1 ---

Element order1 = [Link]("order");


[Link](order1);

// Set attribute on order1

Attr orderId1 = [Link]("orderId");


[Link]("o101");
[Link](orderId1);
// Alternative: [Link]("orderId", "o101");

54
// Create item element for order1

Element item1 = [Link]("item");


[Link]("itemId", "i50");
[Link]([Link]("Laptop")); // Add text content
[Link](item1);

// Create quantity element for order1

Element qty1 = [Link]("quantity");


[Link]([Link]("1"));
[Link](qty1);

// --- Create Order 2 (similar process) ---

Element order2 = [Link]("order");


[Link](order2);
[Link]("orderId", "o102");
//... add item and quantity for order 2...

// 4. Write the DOM document to an XML file

TransformerFactory transformerFactory =
[Link]();
Transformer transformer = [Link]();

// Configure for pretty printing (indentation)

[Link]([Link], "yes");
[Link]("{[Link] "2"); //
Optional: specific indent amount

DOMSource source = new DOMSource(doc);


StreamResult result = new StreamResult(new
File("output_orders_java_dom.xml"));

55
[Link](source, result);

[Link]("Generated XML file: output_orders_java_dom.xml");

} catch (Exception e) {
[Link]();
}
}
}

(Based on DOM concepts and standard Java XML transformation)


36

• JAXB (Java Architecture for XML Binding):


o Concept: A higher-level API primarily focused on mapping XML data to and from
Java objects (POJOs - Plain Old Java Objects). It simplifies XML processing
36

by allowing developers to work directly with Java objects instead of low-level


DOM nodes or SAX events. 36

o Process for Generation (Marshalling):


1. Define Java Classes: Create Java classes that represent the structure of
the desired XML.
2. Annotate Classes: Use JAXB annotations (from
[Link].* or [Link].* depending on
Java version) like @XmlRootElement, @XmlElement, @XmlAttribute,
@XmlType, @XmlElementWrapper to map Java fields/properties to XML
elements/attributes. Alternatively, generate classes from an existing XSD
76

using the xjc tool. 36

3. Create Java Objects: Instantiate and populate these Java objects with the
data to be written to XML.
4. Create JAXBContext: Obtain a JAXBContext instance for the root
class(es) involved ([Link]([Link])). 36

56
5. Create Marshaller: Create a Marshaller object from the JAXBContext
([Link]()). 36

6. Configure Marshaller (Optional): Set properties like


Marshaller.JAXB_FORMATTED_OUTPUT to true for pretty printing. 76

7. Marshal: Call the [Link](javaObject, outputTarget) method,


where javaObject is the root Java object and outputTarget can be a File,
OutputStream, Writer, etc.. 36

o Example (Conceptual - requires annotated classes Orders, Order, Item):


Java
import [Link].*; // Or [Link].* in newer Java versions
import [Link];
import [Link];
import [Link];

// Assume these classes are defined and annotated appropriately


// @XmlRootElement(name="orders")
// class Orders {
// @XmlElement(name="order") List<Order> orderList = new ArrayList<>();
// // Getters/Setters...
// }
// @XmlType(propOrder = { "item", "quantity" }) // Example ordering
// class Order {
// @XmlAttribute String orderId;
// @XmlElement Item item;
// @XmlElement int quantity;
// // Getters/Setters...
// }
// class Item {
// @XmlAttribute String itemId;
// @XmlValue String value; // For text content
// // Getters/Setters...
// }

public class JavaJAXBGenerator {

57
public static void main(String args) {
try {
// 1. Create Java Objects

Orders orders = new Orders();


[Link] = new ArrayList<>();

Order order1 = new Order();


[Link] = "o101";
[Link] = 1;
Item item1 = new Item();
[Link] = "i50";
[Link] = "Laptop";
[Link] = item1;
[Link](order1);

Order order2 = new Order();


[Link] = "o102";
[Link] = 2;
Item item2 = new Item();
[Link] = "i60";
[Link] = "Keyboard";
[Link] = item2;
[Link](order2);

// 2. Create JAXB Context and Marshaller

JAXBContext context = [Link]([Link]); //


Context for root class
Marshaller marshaller = [Link]();

// 3. Configure for pretty printing

[Link](Marshaller.JAXB_FORMATTED_OUTPUT, true);

58
// 4. Marshal to XML File

File outputFile = new File("output_orders_java_jaxb.xml");


[Link](orders, outputFile); // Marshal the root object

[Link]("Generated XML file: " + [Link]());

} catch (JAXBException e) {
// Handle JAXB-specific exceptions

[Link]();
}
}
}

(Based on concepts from ) 36

5.6 Module Insights & Connections

The availability of multiple parsing (DOM, SAX, StAX, iterparse) and generation (DOM,
JAXB, ElementTree/lxml) techniques within standard libraries or popular third-party
packages for languages like Python and Java underscores XML's maturity and
continued relevance as a data format. While newer formats like JSON have gained
35

popularity for specific use cases (especially web APIs), the robust tooling for XML
reflects its deep integration into enterprise systems, configuration management,
document markup, and established data exchange protocols. Libraries like lxml in
Python and JAXB in Java provide high-level abstractions that significantly simplify
common XML processing tasks for developers. They encapsulate much of the
8

complexity of raw DOM manipulation or SAX event handling, allowing developers to


focus more on the application logic rather than the intricacies of XML parsing or
serialization. However, an understanding of the underlying models (DOM's tree
structure, SAX's event stream) remains valuable for optimizing performance, managing

59
memory efficiently (especially with large files), and troubleshooting issues that may
arise when using these higher-level tools. 35

5.7 Practical Project Idea: XML-based Config File Reader/Writer

Task: Develop a simple application configuration manager using XML.

1. Design XML Format: Define an XML structure for storing configuration settings.
For example:
XML
<configuration>
<database type="mysql">

<host>localhost</host>

<port>3306</port>

<username>user</username>

<password encrypted="false">pass123</password>

</database>

<api service="weather">

<url>[Link]

<key>your_api_key</key>

</api>

<ui>

<theme>dark</theme>

<fontSize>12</fontSize>

</ui>
</configuration>

2. Choose Language and Library: Select either Python (with ElementTree or lxml)
or Java (with JAXP DOM or JAXB).
3. Implement Reader: Write code to parse the [Link] file. Implement
functions/methods to:
o getSetting(xpath_or_key): Retrieves the value of a specific setting (e.g., get
database host, get weather API key). Using XPath can make this flexible.
60
o getSection(section_name): Retrieves all settings within a section (e.g., get all
database settings).
4. Implement Writer: Write code to modify the configuration data structure in
memory and save it back to [Link]. Implement functions/methods to:
o setSetting(xpath_or_key, value): Modifies an existing setting or adds a new one.
Handle creating parent elements if they don't exist.
o saveConfiguration(): Writes the current configuration state back to the XML file,
preserving the structure and potentially using pretty printing.
5. Testing: Create a sample [Link], run the program to read settings, modify a
few values (e.g., change the theme, update the password), add a new setting (e.g., a
new API key), and verify that the [Link] file is correctly updated upon saving.

Module 6: XML and Databases


As XML became prevalent for data representation and exchange, the need arose to
store and manage XML documents persistently within database systems. This presents
challenges due to the potential mismatch between XML's hierarchical, semi-structured
nature and the tabular model of traditional Relational Database Management Systems
(RDBMS). 72

6.1 Storing XML: Challenges and Approaches

Storing XML effectively requires balancing the need to preserve the document's
structure and semantics with the database's capabilities for efficient storage, querying,
and updating. The main challenges include handling XML's flexible schema, nested
structures, and potential for mixed content within the constraints of a chosen database
model. 72

Two primary strategies have emerged:

1. XML-Enabled Relational Databases: Leveraging existing RDBMS infrastructure


by adding features to store and query XML data. This includes methods like
72

storing XML in large object types (CLOBs/BLOBs), decomposing XML into

61
relational tables (shredding), or using dedicated XML-specific data types (like
XMLType).
2. Native XML Databases (NXDs): Databases designed specifically to manage XML
documents as their fundamental data unit, often using an internal model based on
the XML structure itself (like DOM). 72

6.2 Relational Database Storage Methods

RDBMS vendors have implemented several techniques to accommodate XML data:

• CLOB/BLOB Storage:
o Concept: The simplest approach involves storing the entire XML document as
opaque data within a Character Large Object (CLOB) or Binary Large Object
(BLOB) column in a relational table. 72

o Pros: Easy to implement; guarantees preservation of the exact original


document, including formatting and comments (especially CLOBs) ; can store
72

any XML document regardless of structure or validity. 72

o Cons: Highly inefficient for querying or updating specific parts of the XML
content, as the entire object must typically be retrieved and parsed by the
application or database functions for each operation. Performance degrades
72

significantly with document size. Indexing capabilities on the XML content itself
72

are very limited.


• Shredding (Object-Relational Mapping):
o Concept: This method involves parsing the XML document and decomposing its
structure and content into a set of predefined relational tables. An XML
72

Schema often guides this mapping process. For example, a <customer>


72

element might map to a Customers table, and its child <order> elements might
map to an Orders table with a foreign key linking back to Customers.
o Pros: Allows the use of standard SQL for querying the decomposed data;
leverages mature RDBMS features like indexing, transactions, and referential
integrity on the relational tables ; can be efficient for highly structured, data-
72

centric XML that maps well to tables. 72

62
o Cons: The mapping process can be complex to design and implement ; loss of 72

the original XML document structure and hierarchy (DOM fidelity) is common ; 72

reconstructing the original XML from the relational tables can be


computationally expensive, often requiring complex SQL joins ; inflexible to
72

changes in the XML schema, as changes may require altering the relational
schema ; performs poorly for document-centric XML, deeply nested structures,
72

or XML with mixed content. 72

• XMLType Columns (Hybrid Approach):


o Concept: Many major RDBMS (e.g., Oracle, SQL Server, PostgreSQL with
JSON/XML types) offer dedicated data types specifically designed for storing
XML within a relational table. The database manages the internal storage,
72

which might be CLOB-based, binary XML, or an object-relational shredded


representation, often chosen based on whether an XML Schema is associated
with the column. 85

o Pros: Provides a middle ground, often allowing schema registration and


validation ; enables querying directly against the XML content using XPath
72

and/or XQuery embedded within SQL statements ; supports specialized XML


84

indexes (on structure or values) for improved query performance compared to


raw CLOBs ; can offer better preservation of XML information model than pure
86

shredding. 87

o Cons: Specific features, performance characteristics, and internal storage


mechanisms vary significantly between database vendors; performance is
highly dependent on the chosen internal storage (CLOB vs. binary vs.
shredded) and the effectiveness of XML indexing ; schema mapping and
85

annotation can still be complex for intricate XML structures ; while the logical
72

XML structure is usually preserved, physical details like whitespace or


comments might be lost unless stored internally as CLOB. 85

6.3 Native XML Databases (NXD)

Native XML Databases are architected specifically for managing XML data.

63
• Concept: NXDs treat the XML document (or a logical fragment like a subtree
rooted at an element) as the fundamental unit of storage and retrieval. They72

operate directly on an XML data model, often conceptually similar to the DOM,
rather than mapping XML to a different model like relational tables. 72

• Pros:
o Natural Fit: Designed inherently for hierarchical, semi-structured XML data,
handling variations and complex nesting naturally. 84

o Fidelity: Preserve the full XML structure and hierarchy (DOM fidelity). 72

o Querying: Provide native, efficient support for XML query languages like XPath
and XQuery. Querying based on document structure is typically very efficient.
82 72

o Schema Flexibility: Often allow storage of XML without a predefined schema, or


handle evolving schemas more gracefully than relational shredding
approaches. Schema validation is typically still supported if required.
72

o Indexing: Offer specialized indexing techniques optimized for XML structures,


paths, values, and often include integrated full-text search capabilities. 82

o Suitability: Particularly well-suited for document-centric applications (e.g.,


content management, digital libraries, scientific data) or data with irregular or
unpredictable structures. 72

• Cons:
o Maturity/Features: May lag behind top-tier RDBMS in certain enterprise features
like transaction models, complex security controls, or breadth of administrative
tooling, although many NXDs now offer robust capabilities in these areas. 82

o SQL Integration: Support for standard SQL querying might be limited or non-
existent, requiring developers to learn XPath/XQuery. 82

o Ecosystem: Smaller market share compared to RDBMS, which might affect the
availability of third-party tools, experienced developers, and community support.
o Cost: Some high-performance commercial NXDs can have significant licensing
costs. 82

• Examples: Notable NXDs include MarkLogic Server (commercial) , eXist-db (open


82

source) , and BaseX (open source).


82 88

6.4 Comparison of Storage Approaches


64
Choosing the right storage method depends on the specific characteristics of the XML
data and the application's requirements.

Criterion CLOB/BLOB Shredding XMLType Native XML


Storage (Object- Columns Database (NXD)
Relational) (RDBMS)

Data Model Fit Any XML; Poor Best for Hybrid; Good Best for semi-
fit for querying structured, data- with schema structured,
centric XML support document-
centric XML

Query Application Standard SQL on SQL + Native


Language parsing; relational tables embedded XPath/XQuery
Limited XPath/XQuery
SQL/Text

Query Poor for Good for Variable; Better Generally good


Performance specific parts relational than CLOB w/ for structural
queries; Poor for indexes queries
XML
reconstruction

Update Poor (whole Good for Variable; Good for


Performance object rewrite) targeted Potentially better targeted XML
relational than CLOB updates
updates

Large Docs Poor Can be good if Variable Generally good


Performance mapping is (depends on (designed for
efficient internal storage) XML)

65
Schema High (stores Low (tied to Medium (can High (often
Flexibility anything) relational store w/o schema-agnostic
schema) schema; changes or flexible)
may be hard)

Data Integrity None inherent High (via Medium (via Medium (via
to XML content RDBMS optional XML optional XML
constraints) Schema Schema
validation) validation)

XML Fidelity High (exact Low (structure Medium (logical High (logical
copy, esp. lost) structure structure
CLOB) preserved, preserved)
physical may
vary)

Ease of Simple Complex Moderately Moderate


Implementation mapping complex (vendor- (requires
required specific) learning NXD
concepts)

(Table data synthesized from ) 72

6.5 Module Insights & Connections

The dichotomy between storing XML in relational systems versus native XML databases
illustrates a classic technological dilemma: adapt the established, ubiquitous technology
(RDBMS) to handle a new data format, or develop a new technology (NXD) specifically
tailored to that format. RDBMS approaches, like shredding or CLOB storage,
72

fundamentally attempt to fit XML within the relational paradigm, often leading to
compromises in terms of performance, fidelity, or flexibility. Native XML databases,
72

conversely, build their architecture around the XML model itself, prioritizing efficient

66
handling of hierarchical structures and XML-specific querying, but potentially sacrificing
some of the maturity or broad ecosystem of RDBMS. 72

However, the landscape is not static. The development and enhancement of XMLType
data types and associated features (like XML indexing and integrated XPath/XQuery
support) within major relational databases represent a significant evolution. This trend
85

indicates that RDBMS vendors recognized the shortcomings of earlier approaches


(simple CLOBs or pure shredding) and are actively working to incorporate more
sophisticated, native-like XML processing capabilities directly into their platforms. This
ongoing development aims to bridge the gap between the relational and XML worlds,
offering users a more integrated experience and blurring the lines between traditional
XML-enabled RDBMS and native XML databases, providing more viable options for
managing XML data within familiar relational environments.

6.6 Practical Idea: Scenario Discussion

Task: Divide students into groups and present them with the following scenarios. Each
group should discuss the pros and cons of storing the described XML data using (a)
CLOB/BLOB in RDBMS, (b) Shredding to RDBMS, (c) XMLType in RDBMS, and (d) a
Native XML Database. They should recommend an approach for each scenario and
justify their choice.

• Scenario 1: Product Catalog: An e-commerce company stores product


information (ID, name, description, price, category, supplier details) in XML format.
The structure is relatively stable, and frequent queries involve searching for
products by price range, category, or name. Updates mainly involve changing
prices or adding new products. SQL-based reporting is essential.
• Scenario 2: Medical Records Archive: A hospital archives patient discharge
summaries as XML documents. These documents have a complex, deeply nested
structure defined by an industry standard (like HL7 CDA), but individual records
may omit optional sections. The primary need is long-term archival and occasional
retrieval of specific patient summaries based on patient ID or date. Complex

67
queries across the content of many records are rare, but preserving the exact
original document structure is important for compliance.
• Scenario 3: Configuration Files: A software application uses XML files to store its
configuration settings, including nested sections for database connections, UI
preferences, and plugin settings. The application reads the entire configuration on
startup and occasionally updates specific values. The structure might evolve slightly
between application versions.
• Scenario 4: Digital Library: A university library digitizes historical manuscripts,
encoding them in TEI (Text Encoding Initiative) XML. These documents contain rich
markup for text structure, annotations, and metadata. Researchers need to perform
complex structural queries (e.g., find all annotations within chapter headings) and
full-text searches across the entire collection. The XML structure is complex and
document-centric.

Module 7: Queries on XML Documents


Once XML data is stored or received, extracting specific information requires
specialized query languages. While simple navigation can be done programmatically
using DOM or iterative parsing, dedicated query languages offer more power and
declarative syntax. The two primary standards are XPath and XQuery. 51

7.1 Introduction to XML Querying

XML's structure, being hierarchical rather than tabular, necessitates query languages
different from SQL used for relational databases. These languages need to navigate the
tree structure and select nodes (elements, attributes, text) based on their path, names,
values, and relationships to other nodes. XPath provides the foundation for path
52

navigation, while XQuery builds upon it to offer a more complete, functional query
language. 55

7.2 XPath (XML Path Language) Deep Dive

XPath is a language for addressing and selecting parts of an XML document. It uses a
46

path-based syntax, similar to file system paths, to navigate the XML tree. XPath is not
51

68
typically used standalone but is a crucial component within other technologies like
XSLT, XQuery, XML Schema (for key constraints), DOM libraries, and XML databases. 4

• Purpose: To select a set of nodes (a node-set) or compute values (string, number,


boolean) based on the content and structure of an XML document. 51

• Syntax Components:
o Location Path: An expression that selects nodes. It consists of one or more
location steps, separated by /.
o Location Step: Has three parts: an axis, a node test, and zero or more
predicates.
▪ Axis: Specifies the relationship between the selected nodes and the
context node (the node from which the step is evaluated). Key axes 51

include:
▪ child: Selects children (default axis if none specified). 91

▪ attribute (abbr: @): Selects attributes. 52

▪ parent (abbr: ..): Selects the parent. 52

▪ self (abbr: .): Selects the context node itself. 52

▪ descendant: Selects descendants at any level below. 91

▪ descendant-or-self (abbr: // when used at the start or after /): Selects


context node and all descendants. 52

▪ ancestor: Selects parent, grandparent, etc., up to the root. 91

▪ ancestor-or-self: Selects context node and all ancestors. 91

▪ following-sibling: Selects subsequent siblings. 91

▪ preceding-sibling: Selects preceding siblings. 91

▪ following: Selects all nodes after the context node in document order,
excluding descendants. 91

▪ preceding: Selects all nodes before the context node in document order,
excluding ancestors. 91

▪ Node Test: Specifies the type or name of the nodes to select along the
chosen axis. Examples:
52

▪ elementName: Selects elements with the specified name.


▪ *: Selects all elements along the axis.

69
▪ text(): Selects text nodes.
▪ node(): Selects any node of any type.
▪ comment(): Selects comment nodes.
▪ processing-instruction(): Selects processing instruction nodes.
▪ Predicate: Filters the node-set selected by the axis and node test.
Enclosed in square brackets [...]. Can contain:
52

▪ Numbers for position (e.g., , [last()]). Note: XPath positions are 1-


1 52

based.
▪ Boolean expressions involving comparisons (=, !=, <, >, <=, >=),
functions, or other XPath expressions relative to the node being tested
(e.g., [@type='mobile'], [price > 50], [starts-with(name, 'A')],
[count(phone) > 1]). 52

▪ Existence tests (e.g., [email] selects nodes that have at least one email
child). 52

• Absolute vs. Relative Paths: Paths starting with / begin at the document's root
node. Paths starting with // begin searching from the root but match anywhere.
Paths not starting with / or // are evaluated relative to the current context node.
• Functions: XPath includes a core library of functions for various operations. 50

o Node Functions: count(), position(), last(), name(), local-name(), `namespace

Works cited

1. XML - MDN Web Docs Glossary: Definitions of Web-related terms - Mozilla,


accessed April 24, 2025, [Link]
2. Extensible Markup Language (XML) 1.0 (Fifth Edition) - W3C, accessed April 24,
2025, [Link]
3. XML introduction - XML: Extensible Markup Language - MDN Web Docs - Mozilla,
accessed April 24, 2025, [Link]
US/docs/Web/XML/Guides/XML_introduction
4. XML Basics with Python | Martian Defense NoteBook, accessed April 24, 2025,
[Link]
python

70
5. XML Guides - Extensible Markup Language - MDN Web Docs, accessed April 24,
2025, [Link]
6. Semi-structured data - Wikipedia, accessed April 24, 2025,
[Link]
7. Basic HTML syntax - Learn web development | MDN, accessed April 24, 2025,
[Link]
US/docs/Learn_web_development/Core/Structuring_content/Basic_HTML_syntax
8. Python XML Tutorial: Element Tree Parse & Read - DataCamp, accessed April
24, 2025, [Link]
9. XML DTD - W3Schools, accessed April 24, 2025,
[Link]
10. What is semi-structured data? - CrowdStrike, accessed April 24, 2025,
[Link]
structured-data/
11. Structured Data Versus Semi-Structured Data | Snowflake, accessed April 24,
2025, [Link]
data/
12. What is Semi Structured Data? - Sotero, accessed April 24, 2025,
[Link]
13. Unstructured Data: Structured vs Semi-Structured | Starburst, accessed April 24,
2025, [Link]
14. Structured vs. Semi-Structured vs. Unstructured Data in PostgreSQL | Timescale,
accessed April 24, 2025, [Link]
structured-vs-unstructured-data-in-postgresql
15. What Is Structured, Semi-Structured and Unstructured Data? - SingleStore,
accessed April 24, 2025, [Link]
semi-structured-and-unstructured-data/
16. Difference between Structured, Semi-structured and Unstructured data |
GeeksforGeeks, accessed April 24, 2025,
[Link]
and-unstructured-data/

71
17. Is XML semi structured data? - Stack Overflow, accessed April 24, 2025,
[Link]
18. XSLT: Extensible Stylesheet Language Transformations - XML - MDN Web Docs,
accessed April 24, 2025, [Link]
US/docs/Web/XML/XSLT
19. Understanding HTML, XML and XHTML | WebKit, accessed April 24, 2025,
[Link]
20. W3C XML Schema (XSD) Validation online, accessed April 24, 2025,
[Link]
21. Validate an XML document by using DTD, XDR, or XSD - Visual Basic - Learn
Microsoft, accessed April 24, 2025, [Link]
us/troubleshoot/developer/visualstudio/visual-basic/language-compilers/validate-
xml-document-by-dtd-xdr-xsd
22. Difference between dtd and xsd | PDF - SlideShare, accessed April 24, 2025,
[Link]
23. Difference between Document Type Definition (DTD) and XML ..., accessed April
24, 2025, [Link]
definition-dtd-and-xml-schema-definition-xsd/
24. Understanding the Advantages & Disadvantages of DTD in XML - Hurix Digital,
accessed April 24, 2025, [Link]
advantages-and-disadvantages/
25. Describing your Data: DTDs and XML Schemas, accessed April 24, 2025,
[Link]
26. DTD vs XSD | Learn the Difference between DTD and XSD - EDUCBA, accessed
April 24, 2025, [Link]
27. xml - How to choose between DTD and XSD - Stack Overflow, accessed April 24,
2025, [Link]
and-xsd
28. DTD or XML Schema. Which one is better? [closed] - Stack Overflow, accessed
April 24, 2025, [Link]
which-one-is-better

72
29. What is DTD in XML - GeeksforGeeks, accessed April 24, 2025,
[Link]
30. W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures, accessed
April 24, 2025, [Link]
31. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes, accessed
April 24, 2025, [Link]
32. XML Schema (XSD) Validation with XmlSchemaSet - .NET ..., accessed April 24,
2025, [Link]
validation-with-xmlschemaset
33. DTD versus XSD - Oracle Forums, accessed April 24, 2025,
[Link]
34. XML Validation with XSD in Visual Studio IDE - Stack Overflow, accessed April
24, 2025, [Link]
visual-studio-ide
35. How to Parse XML in Python | ScrapingAnt, accessed April 24, 2025,
[Link]
36. Java Architecture for XML Binding (JAXB) - Oracle, accessed April 24, 2025,
[Link]
37. How to generate xml files from xsd? (Java in General forum at Coderanch),
accessed April 24, 2025, [Link]
xsd
38. CSS: Cascading Style Sheets - MDN Web Docs, accessed April 24, 2025,
[Link]
39. Documentation of HTML and CSS - The freeCodeCamp Forum, accessed April
24, 2025, [Link]
css/551354
40. CSS - Wikipedia, accessed April 24, 2025, [Link]
41. XML in Mozilla - Archive of obsolete content | MDN - Cach3, accessed April 24,
2025, [Link]
US/docs/Archive/Mozilla/XML_in_Mozilla

73
42. Transforming with XSLT - Web APIs | MDN, accessed April 24, 2025,
[Link]
US/docs/Web/API/Document_Object_Model/Transforming_with_XSLT
43. How to style SVG with external CSS? - Stack Overflow, accessed April 24, 2025,
[Link]
44. An overview - XSLT: Extensible Stylesheet Language Transformations - MDN
Web Docs, accessed April 24, 2025, [Link]
US/docs/Web/XML/XSLT/Guides/Transforming_XML_with_XSLT/An_Overview
45. Selectors Level 4 - W3C, accessed April 24, 2025,
[Link]
46. XML: Extensible Markup Language - MDN Web Docs, accessed April 24, 2025,
[Link]
47. Transforming XML with XSLT - XSLT: Extensible Stylesheet ..., accessed April 24,
2025, [Link]
US/docs/Web/XML/XSLT/Guides/Transforming_XML_with_XSLT
48. XSLT - Wikipedia, accessed April 24, 2025, [Link]
49. XSL Transformations (XSLT) Version 2.0 (Second Edition) - W3C, accessed April
24, 2025, [Link]
50. XPath functions - MDN Web Docs, accessed April 24, 2025,
[Link]
51. XPath - XML: Extensible Markup Language - MDN Web Docs, accessed April 24,
2025, [Link]
52. XPath | Introduction to Path Description Language - IONOS, accessed April 24,
2025, [Link]
tutorial/
53. XPath reference - MDN Web Docs - Mozilla, accessed April 24, 2025,
[Link]
54. DOM scripting introduction - Learn web development | MDN, accessed April 24,
2025, [Link]
US/docs/Learn_web_development/Core/Scripting/DOM_scripting

74
55. Essential XQuery - The XML Query Language - UJI, accessed April 24, 2025,
[Link]
56. Reg: Difference between XQuery and XPath - Oracle Forums, accessed April 24,
2025, [Link]
and-xpath-6600
57. XHTML - MDN Web Docs Glossary: Definitions of Web-related terms, accessed
April 24, 2025, [Link]
58. Why did XML lose out to XHTML, then HTML 5, on the web?, accessed April 24,
2025, [Link]
lose-out-to-xhtml-then-html-5-on-the-web
59. HTML - MDN Web Docs Glossary: Definitions of Web-related terms, accessed
April 24, 2025, [Link]
60. HTML5 - Wikipedia, accessed April 24, 2025, [Link]
61. 14 The XML syntax - HTML Spec WHATWG, accessed April 24, 2025,
[Link]
62. : The HTML Document / Root element - HTML: HyperText Markup Language -
MDN Web Docs, accessed April 24, 2025, [Link]
US/docs/Web/HTML/Element/html
63. Introduction to the DOM - Web APIs | MDN, accessed April 24, 2025,
[Link]
US/docs/Web/API/Document_Object_Model/Introduction
64. JavaScript - W3C DOM - Introduction - QuirksMode, accessed April 24, 2025,
[Link]
65. Using the Document Object Model - Web APIs - MDN Web Docs, accessed April
24, 2025, [Link]
US/docs/Web/API/Document_Object_Model/Using_the_Document_Object_Model
66. Document Object Model (DOM) - Web APIs - MDN Web Docs, accessed April 24,
2025, [Link]
US/docs/Web/API/Document_Object_Model

75
67. Using the W3C DOM Level 1 Core - Web APIs, accessed April 24, 2025,
[Link]
W3C_DOM_Level_1_Core
68. Introduction to the DOM - Web APIs, accessed April 24, 2025,
[Link]
69. Document Object Model - Devopedia, accessed April 24, 2025,
[Link]
70. XML Parsing for Java, accessed April 24, 2025, [Link]
documents/oracle/E11882_01/appdev.112/e10708/adx_j_parser.htm
71. python - XML parsing - ElementTree vs SAX and DOM - Stack Overflow,
accessed April 24, 2025, [Link]
elementtree-vs-sax-and-dom
72. [Link], accessed April 24, 2025,
[Link]
73. Parsing an XML File Using SAX (The Java™ Tutorials > Java API for ..., accessed
April 24, 2025, [Link]
74. How do I parse my simple XML file with Java and SAX? - Stack Overflow,
accessed April 24, 2025, [Link]
parse-my-simple-xml-file-with-java-and-sax
75. Trail: Java API for XML Processing (JAXP) (The Java™ Tutorials), accessed April
24, 2025, [Link]
76. JAXP vs JAXB: XML Processing APIs Compared - Baeldung, accessed April 24,
2025, [Link]
77. [Link] — The ElementTree XML API — Python 3.13 ..., accessed
April 24, 2025, [Link]
78. How to Parse XML in Python? Multiple Methods Covered - Bright Data, accessed
April 24, 2025, [Link]
79. Python lxml tutorial: XML processing and web scraping - Apify Blog, accessed
April 24, 2025, [Link]
80. python - How to write XML declaration using [Link] - Stack
Overflow, accessed April 24, 2025,

76
[Link]
xml-etree-elementtree
81. JAXB Release Documentation - Java EE, accessed April 24, 2025,
[Link]
82. The Advantages/Disadvantages of XML compared to RDMS - Stack Overflow,
accessed April 24, 2025, [Link]
advantages-disadvantages-of-xml-compared-to-rdms
83. Modeling: Xml vs. Relational Database - Stack Overflow, accessed April 24, 2025,
[Link]
84. NATIVE XML DATABASES vs. RELATIONAL DATABASES IN DEALING WITH
XML DOCUMENTS Gordana Pavlovic-Lazetic, accessed April 24, 2025,
[Link]
85. CLOB Vs XMLTYPE - Oracle Forums, accessed April 24, 2025,
[Link]
86. XML Data – To be stored or not to be stored, and beyond… - http, accessed April
24, 2025, [Link]
to-be-stored-and-beyond%E2%80%A6/
87. XML usage in relational databases | Open Textbooks for Hong Kong, accessed
April 24, 2025, [Link]
88. 9 Critical Types of XML Tools for Developers - Sonra, accessed April 24, 2025,
[Link]
89. XPath Quick Reference (XQuery and XSLT Reference Guide) — MarkLogic 9
Product Documentation, accessed April 24, 2025,
[Link]
90. Difference Between XQuery and XPath | GeeksforGeeks, accessed April 24,
2025, [Link]
91. Axes - XPath - MDN Web Docs - Mozilla, accessed April 24, 2025,
[Link]
92. XQuery vs XPath | Learn Top 14 Comparisons with Infographics - EDUCBA,
accessed April 24, 2025, [Link]

77

You might also like