Introduction To XML Extensible Markup Language: Carol Wolf Computer Science Department
Introduction To XML Extensible Markup Language: Carol Wolf Computer Science Department
What is XML
XML stands for eXtensible Markup Language. A markup language is used to provide information about a document. Tags are added to the document to provide the extra information. HTML tags tell a browser how to display the document. XML tags give a reader some idea what some of the data means.
Advantages of XML
XML is text (Unicode) based.
Takes up less space. Can be transmitted efficiently.
XML Rules
Tags are enclosed in angle brackets. Tags come in pairs with start-tags and end-tags. Tags must be properly nested.
<name><email></name></email> is not allowed. <name><email></email><name> is.
XML in any combination of cases is not allowed as part of a tag. Tags may not contain < or &. Tags follow Java naming conventions, except that a single colon and other characters are allowed. They must begin with a letter and may not contain white space. Documents must have a single root tag that begins the document.
Encoding
XML (like Java) uses Unicode to encode characters. Unicode comes in many flavors. The most common one used in the West is UTF-8. UTF-8 is a variable length code. Characters are encoded in 1 byte, 2 bytes, or 4 bytes. The first 128 characters in Unicode are ASCII. In UTF-8, the numbers between 128 and 255 code for some of the more common characters used in western Europe, such as , , , or . Two byte codes are used for some characters not listed in the first 256 and some Asian ideographs. Four byte codes can handle any ideographs that are left. Those using non-western languages should investigate other versions of Unicode.
Well-Formed Documents
An XML document is said to be well-formed if it follows all the rules. An XML parser is used to check that all the rules have been obeyed. Recent browsers such as Internet Explorer 5 and Netscape 7 come with XML parsers. Parsers are also available for free download over the Internet. One is Xerces, from the Apache open-source project. Java 1.4 also supports an open-source parser.
Markup for the data aids understanding of its purpose. A flat text file is not nearly so clear.
Alice Lee [email protected] 212-346-1234 1985-03-22 The last line looks like a date, but what is it for?
Expanded Example
<?xml version = 1.0 ?> <address> <name> <first>Alice</first> <last>Lee</last> </name> <email>[email protected]</email> <phone>123-45-6789</phone> <birthday> <year>1983</year> <month>07</month> <day>15</day> </birthday> </address>
first
last
year
month
day
XML Trees
An XML document has a single root node. The tree is a general ordered tree.
A parent node may have any number of children. Child nodes are ordered, and may have siblings.
Preorder traversals are usually used for getting information out of the tree.
Validity
A well-formed document has a tree structure and obeys all the XML rules. A particular application may add more rules in either a DTD (document type definition) or in a schema. Many specialized DTDs and schemas have been created to describe particular areas. These range from disseminating news bulletins (RSS) to chemical formulas. DTDs were developed first, so they are not as comprehensive as schema.
A DTD determines how many times a node may appear, and how child nodes are ordered.
Schemas
Schemas are themselves XML documents. They were standardized after DTDs and provide more information about the document. They have a number of data types including string, decimal, integer, boolean, date, and time. They divide elements into simple and complex types. They also determine the tree structure and how many children a node may have.
XSLT
Extensible Stylesheet Language Transformations
XSLT is used to transform one xml document into another, often an html document. The Transform classes are now part of Java 1.4. A program is used that takes as input one xml document and produces as output another. If the resulting document is in html, it can be viewed by a web browser. This is a good way to display xml data.
Parsers
There are two principal models for parsers. SAX Simple API for XML
Uses a call-back method Similar to javax listeners
References
Elliotte Rusty Harold, Processing XML with Java, Addison Wesley, 2002. Elliotte Rusty Harold and Scott Means, XML Programming, OReilly & Associates, Inc., 2002. W3Schools Online Web Tutorials, http://www.w3schools.com.