XML Parsers
Overview
Types of parsers
Using XML parsers
SAX
DOM
DOM versus SAX
Products
Conclusion
Types of Parsers
There are several different ways to categorise
parsers:
Validating versus non-validating parsers
Parsers that support the Document Object Model
(DOM)
Parsers that support the Simple API for XML (SAX)
Parsers written in a particular language (Java, C++,
Perl, etc.)
Non-validating Parsers
Speed and efficiency
- It takes a significant amount of effort for an XML
parser to process a DTD and make sure that
every element in an XML document follows the
rules of the DTD.
If only want to find tags and extract
information - use non-validating
Using XML Parsers
Three basic steps to use an XML parser
Create a parser object
Pass your XML document to the parser
Process the results
Generally, writing out XML is outside scope of
parsers (though some may implement proprietary
mechanisms)
Parsing XML
Two established API's:
SAX (Simple API for XML)
Define handlers containing methods as XML
parsed
DOM (Document Object Model)
Defines a logical tree representing the parsed
XML
Parsing XML: DOM
Document Object Model
standard API for accessing and creating XML data
tree-based
programming language indepedent
developed by W3C
whole document is read into memory
read and write
Creating a DOM Tree
A DOM implementation will have a method to pass a
XML file to a factory object that will return a
Document object that represents root element of
whole document
After this, may use DOM standard interface to
interact with XML structure
A
P Application
I
Parsing XML: DOM
XML File DOM Tree
DOM Interfaces
The DOM defines several interfaces
Node The base data type of the DOM
Element Represents element
Attr Represents an attribute of an element
Text The content of an element or attribute
Document Represents the entire XML document.
A Document object is often referred to
as a DOM tree
DOM Level
DOM Level 1
- basic functionality for document navigation and
manipulation.
DOM Level 2
- includes a style sheet object model
- defines an event model and provides support for
XML namespaces.
DOM Level 3
- still under development
- addresses document loading and saving
- content model (DTDs and schemas) with document validation
support.
Parsing XML: SAX
Simple API for XML
API for accessing xml data
event based
programming language indepedent
application has to store fragments into memory
read only
Parsing XML: SAX
SAX is an interface to the XML parser based on
streaming and call-backs
You need to implement the HandlerBase interface :
startDocument, endDocument
startElement, endElement
characters
warning, error, fatalError
Parsing XML: SAX
XML File SAX calls
SAX versus DOM
DOM:
read and write
need to move back and forth in data
document is human created
SAX:
read only
huge data or streams
data is machine generated
DOM pro and contra
PRO
The file is parsed only once.
High navigation abilities : this is the aim of the DOM design.
CONTRA
More memory needed since the XML tree is in memory.
SAX pro and contra
PRO
Low memory needs since the XML file is never entirely in
memory
Can deal with XML streams
CONTRA
The file has to be parsed entirely to access any node. Thus,
getting the 10 nodes included in a catalog ended up in parsing
10 times the same file.
Poor navigation abilities : no way to get easily the children of a
given node or the list of "B" nodes
SAX versus DOM
If your document is very large and you only need a
few elements - use SAX
If you need to process many elements and perform
operations on XML - use DOM
If you need to access the XML many times
- use DOM
Parser Products
Xerces4J / Xerces4C++ (Apache)
James Clarks XP (Java)
IBM XML4J / XML4C++
Java Project X (Sun)
Oracles XML Parser for Java
MSXML (Microsoft)
Dan Connollys XML Parser (Phyton)
Conclusion
The parser is key building block for every XML
application.
When building XML applications, you have to think
how will you handle large chunks of data
Choosing between SAX and DOM is not always trivial
The End
Questions?
Thank you!