Parsing X ML With Java
Parsing X ML With Java
• What is a parser?
Formal
grammar
Input Analyzed
Parser
Data
2
XML Examples
3
root element
<?xml version="1.0"?> world.xml
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> validating DTD file
reference to an
<name>Israel</name> entity
<population year="2001">6,199,008</population>
<city capital="yes"><name>Jerusalem</name></city>
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
4
XML Tree Model
element
countries
element
country
continent
name city
Asia city
capital
capital name
Israel name
population country
no Ashdod population
year
6,199,008 year
yes continent name
Jerusalem 60,424,213
2001
Europe
France 2004
attribute simple 5
content
<!ELEMENT countries (country*)> world.dtd
<!ELEMENT country (name,population?,city*)>
<!ATTLIST country continent CDATA #REQUIRED>
<!ELEMENT name (#PCDATA)>
parsed
<!ELEMENT city (name)> Not parsed default
12
DOM Parser
13
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;">
<name>Israel</name>
<population year="2001">6,199,008</population>
<city capital="yes"><name>Jerusalem</name></city>
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
14
Document
The DOM Tree
countries
country
continent
name city
Asia city
capital
capital name
Israel name
population country
no Ashdod population
year
6,199,008 year
yes continent name
Jerusalem 60,424,213
2001
Europe
France 2004 15
Using a DOM Tree
A
P Application
I
XML File DOM Parser DOM Tree
in memory
16
17
Creating a DOM Tree
Entity NamedNodeMap
EntityReference
Run
ProcessingInstruction FragmentVsElement
with 1st argument
DocumentType fragment/element
21
Interfaces in the DOM Tree
Document
Comment Text
22
Node Navigation
23
Node Navigation (cont)
getPreviousSibling()
getFirstChild()
getParentNode() getChildNodes()
getLastChild()
getNextSibling()
24
Not a very OO
approach… reminds Only Element
one of simulating OO Node Properties Nodes have
using unions in C…. attributes…
some would
• Every node has (a dubious design and…) say this
- a type Very few nodes have should’ve been
- a name both a significant name
and a significant value
a property of
the Element
- a value derived class
and not of
- attributes Node.
• The roles of these properties differ according to the
node types
• Nodes of different types implement different interfaces
(that extend Node)
25
Names, Values and Attributes
Interface nodeName nodeValue attributes
Attr name of attribute value of attribute null
CDATASection "#cdata-section" content of the Section null
Comment "#comment" content of the comment null
Document "#document" null null
DocumentFragment "#document-fragment" null null
DocumentType doc-type name null null
Element tag name null NodeMap
Entity entity name null null
EntityReference name of entity referenced null null
Notation notation name null null
ProcessingInstruction target entire content null
Text "#text" content of the text node null
26
Node Types - getNodeType()
ELEMENT_NODE = 1 PROCESSING_INSTRUCTION_NODE = 7
ATTRIBUTE_NODE = 2 COMMENT_NODE = 8
TEXT_NODE = 3 DOCUMENT_NODE = 9
CDATA_SECTION_NODE = 4 DOCUMENT_TYPE_NODE = 10
ENTITY_REFERENCE_NODE = 5 DOCUMENT_FRAGMENT_NODE = 11
ENTITY_NODE = 6 NOTATION_NODE = 12
if (myNode.getNodeType() == Node.ELEMENT_NODE) {
//process node
…
}
Read more about Node Interface
27
import org.w3c.dom.*;
import javax.xml.parsers.*;
depth++;
for (Node child = n.getFirstChild(); child != null;
child = child.getNextSibling()) echo(child);
depth--;
Attribute nodes
} are not included… 29
private int depth = 0;
private String[] NODE_TYPES = {
"", "ELEMENT", "ATTRIBUTE", "TEXT", "CDATA",
"ENTITY_REF", "ENTITY", "PROCESSING_INST",
"COMMENT", "DOCUMENT", "DOCUMENT_TYPE",
"DOCUMENT_FRAG", "NOTATION" };
private void print(Node n) {
for (int i = 0; i < depth; i++) System.out.print(" ");
System.out.print(NODE_TYPES[n.getNodeType()] + ":");
System.out.print("Name: "+ n.getNodeName());
System.out.print(" Value: "+ n.getNodeValue()+"\n");
}} run EchoWithDom,
30
pay attention to the default values
Another Example
public class WorldParser {
33
Node Manipulation
• Children of a node in a DOM tree can be manipulated -
added, edited, deleted, moved, copied, etc.
• To constructs new nodes, use the methods of
Document
- createElement, createAttribute, createTextNode,
createCDATASection etc.
• To manipulate a node, use the methods of Node
- appendChild, insertBefore, removeChild, replaceChild,
setNodeValue, cloneNode(boolean deep) etc.
34
Figure as appears in “The XML Companion” - Neil Bradley
Old
New New
Ref
insertBefore replaceChild
deep = 'false'
35
SAX – Simple API for
XML
36
SAX Parser
• SAX = Simple API for XML
• XML is read sequentially
• When a parsing event happens, the parser
invokes the corresponding method of the
corresponding handler.
• This is called event-driven programming. Most
GUI programs are written using this paradigm.
• The handlers are programmer’s implementation
of standard Java API (i.e., interfaces and classes)
37
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
<city capital="yes"><name>Jerusalem</name></city>
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
38
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
Start
<city capital="yes"><name>Jerusalem</name></city>
Document
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
39
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
Start
<city capital="yes"><name>Jerusalem</name></city>
Element
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
40
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
Start
<city capital="yes"><name>Jerusalem</name></city>
Element
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
41
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
Comment
<city capital="yes"><name>Jerusalem</name></city>
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
42
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
Start
<city capital="yes"><name>Jerusalem</name></city>
Element
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
43
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
Characters
<city capital="yes"><name>Jerusalem</name></city>
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
44
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
<city capital="yes"><name>Jerusalem</name></city>
End
<city><name>Ashdod</name></city>
</country> Element
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
45
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
End
<city capital="yes"><name>Jerusalem</name></city>
Element
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
46
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;"> <!--israel-->
<name>Israel</name>
<population year="2001">6,199,008</population>
End
<city capital="yes"><name>Jerusalem</name></city>
Document
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
47
SAX Parsers
<?xml version="1.0"?>
.
.
. When you see
the start of the
document do …
SAX Parser When you see
the start of an
element do … When you see
the end of an
element do …
48
Used to create a
SAX Parser Handles document
events: start tag,
XML-Reader end tag, etc.
Factory
Handles
Content Parser
Handler Errors
Error
XML Handler Handles
XML Reader DTD
DTD
Handler
Entity Handles
Resolver Entities
49
Creating a Parser
• The SAX interface is an accepted standard
• There are many implementations of many
vendors
- Standard API does not include an actual
implementation, but Sun provides one with JDK
• We would like to be able to change the
implementation used, without changing any code
in the program
- How is this done?
50
Factory Design Pattern
• Have a “factory” class that creates the actual parsers
- org.xml.sax.helpers.XMLReaderFactory
• The factory checks configurations, mainly the value of a
system property, that specify the implementation
- Can be set outside the Java code: a configuration file, a
command-line argument, etc.
• In order to change the implementation, simply change
the system property
53
ContentHandler Methods
• startDocument - parsing begins
• endDocument - parsing ends
• startElement - an opening tag is encountered
• endElement - a closing tag is encountered
• characters - text (CDATA) is encountered
• ignorableWhitespace - white spaces that should
be ignored (according to the DTD)
• and more ...
Read more about ContentHandler Interface
54
The Default Handler
int depth = 0;
58
Fixing The Parser
public class EchoWithSax {
public static void main(String[] args) throws Exception {
XMLReader reader =
XMLReaderFactory.createXMLReader();
reader.setContentHandler(new EchoHandler());
reader.parse("world.xml");
}
}
run EchoWithSax3
60
Attributes Interface
• The Attributes interface provides an access to all
attributes of an element
- getLength(), getQName(i), getValue(i),
getType(i), getValue(qname), etc. #attributes
• The following are possible types for attributes:
CDATA, ID, IDREF, IDREFS, NMTOKEN, NMTOKENS,
ENTITY, ENTITIES, NOTATION
• There is no distinction between attributes that are
defined explicitly from those that are specified in the
DTD (with a default value)
Read more about Attributes Interface
run EchoWithSax and check “capital” attribute, compare to xml
61 source
ErrorHandler Interface
• We implement ErrorHandler to receive error events
(similar to implementing ContentHandler)
• DefaultHandler implements ErrorHandler in
an empty fashion, so we can extend it (as before)
• An ErrorHandler is registered with
- reader.setErrorHandler(handler);
• Three methods:
- void error(SAXParseException ex);
- void fatalError(SAXParserExcpetion ex);
- void warning(SAXParserException ex);
62
Parsing Errors
• Fatal errors disable the parser from continuing parsing
- For example, the document is not well formed, an unknown
XML version is declared, etc.
• Errors (that is recoverable ones) occur for example
when the parser is validating and validity constrains are
violated
• Warnings occur when abnormal (yet legal) conditions
are encountered
- For example, an entity is declared twice in the DTD
• Properties:
- xml-string - the actual text that caused the
current event (read-only with getProperty())
- lexical-handler - see the next slide...
Read more about Properties
66
Lexical Events
67
LexicalHandler Methods
69
Parser Efficiency
• The DOM object built by DOM parsers is usually
complicated (composed of many, many objects) and
requires more memory storage than the XML file itself.
- A lot of time is spent on construction before use.
- For some very large documents, this may be impractical.
• SAX parsers store only local information that is
encountered during the serial traversal and do not create
many temporary objects during traversal.
• Hence, programming with SAX parsers is, in general,
more efficient.
70
Programming using SAX is
Difficult
• In some cases, programming with SAX might
seem difficult at a first glance:
- How can we find, using a SAX parser, elements e1
with ancestor e2?
- How can we find, using a SAX parser, elements e1
that have a descendant element e2?
- How can we find the element e1 referenced by the
IDREF attribute of e2?
• In other cases, using SAX can be more elegant.
71
Node Navigation
• SAX parsers do not provide access to elements other
than the one currently visited in the serial (DFS)
traversal of the document
• In particular,
- They do not read backwards
- They do not enable access to elements by ID or name
• DOM parsers enable any traversal method
• Hence, using DOM parsers can be more comfortable
72
More DOM Advantages
73
Which should we use?
DOM vs. SAX
• If your document is very large and you only need
a few elements – use SAX
• If you need to manipulate (i.e., change) the XML
– use DOM
• If you need to access the XML many times – use
DOM (assuming the file is not too large)
• Depending on the task it hand, it might be easier
for you to visualize it implemented using one of
the APIs and not the other – so use it!
74