Semistructured Data : Peter Buneman
Semistructured Data : Peter Buneman
Peter Buneman
Department of Computer and Information Science
University of Pennsylvania
Philadelphia, PA 191046389
[email protected]
In semistructured data, the information that is normally as- The most obvious motivation comes from the need to bring
sociated with a schema is contained within the data, which is new forms of data into the ambit of conventional database
sometimes called “self-describing”. In some forms of semi- technology. Some of these, such as documents with struc-
structured data there is no separate schema, in others it tured text [3, 21 and data formats [9, 171, while they may
exists but only places loose constraints on the data. Semi- call for increasingly expressive query Ianguages and new op-
structured data has recently emerged as an important topic timization techniques, only require mild extensions to the
of study for a variety of reasons. First, there are data sources existing notion of data models such as ODMG [13]. How-
such as the Web, which we would like to treat as databases ever these extensions still require the prior imposition of
but which cannot be constrained by a schema. Second, it structure on the data, and there are some forms of data for
may be desirable to have an extremely flexible format for which this is genuinely difficult.
data exchange between disparate databases. Third, even The most immediate example of data that cannot be con-
when dealing with structured data, it may be helpful to view strained by a schema is the World-Wide-Web. As database
it. as semistructured for the purposes of browsing. This tu- researchers we would like to think of this as a database, but
torial will cover a number of issues surrounding such data: to what extent are database tools available for querying or
finding a concise formulation, building a sufficiently expres- maintaining the web? Most web queries exploit information
sive language for querying and transformation, and opti- retrieval techniques to retrieve individual pages from their
mizat,ion problems. contents, but there is little available that allows us to use
the structure of the web in formulating queries, and since
the web does not obviously conform to any standard data
1 The motivation model, we need a method of describing its structure.
Another example, little known to the database commu-
The topic of semistructured data (also called unstructured
nity but responsible for piquing the author’s interest in this
data) is relatively recent, and a tutorial on the topic may
topic, is the database management system ACeDB, which
well be premature. It represents, if anything, the conver-
is popular with biologists [36]. Superficially it looks like
gence of a number of lines of thinking about new ways to
an object-oriented database system, for it has a schema
represent and query data that do not completely fit with
language that resembles that of an object-oriented DBMS;
conventional data models. The purpose of this tutorial is
but this schema imposes only loose constraints on the data.
to to describe this motivation and to suggest areas in which
Moreover the relationship between data and schema is not
further research may be fruitful. For a similar exposition,
easily described in object-oriented terms, and there are struc-
the reader is referred to Serge Abiteboul’s recent survey pa-
tures that are naturally expressed in ACeDB, such as trees of
per PI. arbitrary depth, that cannot be queried using conventional
The slides for this tutorial will be made available from a
techniques.
section of the Penn database home page
http://vnn.cis.upenn.edu/“db.
1.2 Data Integration
‘This work was partly supported by the Army Research Office
(DAAH0495-1-0169) and the National Science Foundation (CCR92- A second motivation is that of data exchange and transfor-
16122). mation, which is the starting point for the Tsimmis project
^. . [33, 211 at Stanford. The rationale here is that none of the
-*..A, I’&
ercisting data models is all-embracing, so that it is diflicult
Permission to make digital/kard copiesof all or partof thismaterialfor
to build software that will easily convert between two dis-
personalor classroomuseis granted withoutfeeprovidedthatthecopies
parate models. The Object Exchange Model (OEM) offers
arenot made or distributed for profitor commercial advantage,thecopy-
rightnotice,thetitleof thepublication anditsdateappear,andnoticeis a highly flexible data structure that may be used to cap-
given that copyright is by permission of the ACM, Inc. To copy otherwise, ture most kinds of data and provides a substrate in which
to republish,to poston serversor to redistribute to lists.requires
specific almost any other data structure may be represented. In ef-
permission and/orfee fect, OEM is an internal data structure for exchange of data
PODS ‘97 Tucson Arizona USA between DBMSs, but having such a structure invites the
Copyright 1997 ACM O-89791-910-6/97/05 .X%50 idea of querying data in OEM format directly.
117
Figure 1: An example movie database.
1.3 Browsing both with data, of types such as int and string and possi-
bly other base or external abstract types (video, audio etc.).
A final motivation is that of browsing. Generally speaking, Edges are also with names such as Movie and Title that
a user cannot write a database query without knowledge would normally be used for attribute or class names, We
of the schema. However, schemes may have opaque termi- shall refer to such labels as symbols. Internally they are rep-
nology and the rationale for the design is often difllcult to resented as strings. Note that arrays may be represented by
understand. It may help in understanding the schema to be labeling internal edges with integers. We can formulate the
able to query data without full knowledge of the schema. type of this kind of labeled tree as:
For example the queries,
type label = int 1 string 1 .,. 1 symbol
l Where in the database is the string “Casablanca” to type tree = set(labe1 x tree)
be found?
The first line describes a tagged union or variant, the
l Are there integers in the database greater than 216?
second says that a tree is a set of label/tree pairs. The edges
l What objects in the database have an attribute name out of nodes in our trees are assumed to be unordered.
that starts with “act” There are a number of variations on this basic model,
and it is worth briefly reviewing them. In [5] leaf nodes
Such questions cannot be answered in any generic fashion are labeled with data, internal nodes are not labeled with
by standard relational or object-oriented query languages. meaningful data, and edges are labeled only with symbols
While languages have been proposed that allow schema and
data to be queried simultaneously [24] in the context of type base = id 1 string I...
relational and object-oriented database systems, these lan- type tree = base 1 set(symbo1 x tree)
guages do not have the flexibiity to express complex con- The differences between the two models are minor and
straints on paths, and it is not clear how their implementa- give rise to minor differences in the query language. It is
tion will work on the structures described below. easy to define mappings in both directions.
Another possibility is to allow labels on internal nodes,
2 The Model for example:
type base = int 1 string [ . . . 1 symbol
The unifying idea in semi-structured data is the representa-
tion of data as some kind of graph-like or tree-like structure. type tree = label x set(labe1 x tree)
Although we shall allow cycles in the data, we shall gener- The problem with using this representation directly is
ally refer to these graphs as trees. The example in figure 1 is that it makes the operation of taking the union of two trees
taken from [lo] in which the data model is formalized as an difficult to define. However, by introducing extra edges,
edge labeled graph. The structure is taken (with some inac- this represaentation can be converted into one of the edge-
curacies) from a well-known web database [23] that provides labelled representations above.
a good example of semistructured data. There are several A final and more complex issue is that of object identity,
things to note about it. If one confines ones attention to the by which we mean node labels - or possibly edge labels
parts of the database below Movie edges, the data appears - that, apart from an equality test, are not observable in
fairly regular except that there are two ways of representing the query language. In OEM, object identities are used as
a cast. That is, the data does not quite fit with some re- node labels and place-holders to define trees. While object-
lational or object-oriented presentation. Edges are labeled identities provide an efficient way to define and test equality
118
within a database, they pose problems when comparing data The “select” fragment of UnQL[lO] and the Lore1 query
across databases. See [lo, 25, 321 discussions and related language [5] solve these problems with very similar syntactic
work. forms. Lore& which is a component of the Lore project [27]
It is straightforward to encode relational and object- requires a rich set of overloadings for its operators for deal-
oriented databases in this model, although in the latter case ing with comparisons of objects with values and of values
one must take care to deal with the issue of object-identity. with sets. These are avoided in UnQL by not having object
However, the coding is not unique, and the examples in identity and exploiting a simple form of pattern matching.
[IO] and [.!i]show some differences in how tuples of sets are Other languages that use a SQL-like syntax include a pre-
treated. cursor to Lore1 [34], and WebSQL [29, 71 which contains a
The term “self describing” is often used to describe un- number of constructs specific to web queries. A language
struct,ured data. In each of the models we have described, for web site management is proposed in [18].
the data is a tagged union type, and one can imagine a pro- Having asked what the surface syntax should look like,
gram whose behavior is dynamically determined by “switch- one wants to ask what the underlying computational strat-
ing” on the type. For example, a program’s behavior may egy should be. Here there appear to be two principled strate-
be altered by whether it finds an integer or string as a label, gies. The first is to model the graph as a relational database
and one would expect any language for dealing with semi- and then exploit a relational query language. In our labeled
structured data to incorporate predicates that describe the graph model this is remarkably simple. We can take the
type of an edge or node. The situation is similar to that database as a large relation of type (node-id, label, node-id)
in programming languages. Lisp and many interpreted and and consider the expressive power of relatianal languages
scripting languages are dynamically typed. Predicates are on this structure, but this apparently simple approach has
available to determine (at run time) type of a value or class a number of complications:
of an object. Languages in the Algol tradition (Pascal, C,
&fL, Modula) are statically typed. Predicates are not needed Our labels are drawn from a heterogeneous collection
to determine the type of a value because it is known from the of types, so it may be appropriate to use more than
source code of the program and hence to the programmer. one relation.
There is a good analogy between dynamic type systems and
If information also is held at nodes, one needs addi-
semistructured data on one hand, and static type systems
tional relations to express this.
and databases with schemas on the other
The node identifiers may only be used as temporary
3 Query Languages node labels, and one may want to limit the way they
can appear in the output of the query. How they are
There appear to be two general approaches to devising query used is related to the discussion of object identity.
languages for semistructured data. First, take SQL (or per-
haps OQL[l4,13]) as a starting point and add enough “fea- We are concerned with what is accessible from a given
tures” to perform a useful class of queries. The second ap- “root” by forward traversal of the edges, and one may
proach is to start from a language based on some formal want to limit the languages appropriately.
notion of computation on semistructured data then to mas- Some forms of unbounded search will require recursive
sage that language into acceptable syntax. It is remarkable queries, i.e., a Ggraph datalog”, and such languages are pro-
that the two approaches appear to end up with very similar posed in [26, 161 for the web and for hypertext. Theoretical
languages. treatments of queries that deal with computation on graphs
Let us start with the first approach to see what what or on the web appear in [6, 301. It should also be mentioned
kinds of queries are useful. The following SQL-like syntax that this model of computation is used in [5,15] as a starting
suggests itself: point for optimization.
select Entry.Movie.Title The second strategy is adopted in the basis for UnQL
from DB [ll, lo]. Here the starting point is that of structural recur-
where Entry.Movie.Director ... sion, and is an extension of a principle put forward in [12]
that there are natural forms of computation associated with
However the syntax does not make clear how much of the the type. For semistructured data one starts with the natu-
two pat,hs Entry.Movie.Title and Errtry.Movie.Director ral form of recursion associated with the-recursive datatype
are to be taken as the same. The solution is to introduce of labeled trees. However, some restri&ions need to be
variables to indicate how paths or edges are to be tied to- placed for such recursive programs to be well-defined: we
gether. These variables can then be used in other expres- want them to be well-defined on graphs with cycles. These
sions to form new structures. Label variables, tree variables restrictions give rise to an algebra that can be viewed as
and possibly path variables are needed to express a reason- having two components: a “horizontal” component that ex-
able set, of queries. presses computations across the edges of a given node (and
The next problem is that one wants to specify paths of from this, computations to a fixed depth from the root); and
arbitrary length to find, for example, all the strings in the a %ertical” component that expresses computations that go
database. This requires us to be able to express arbitrary to arbitrary depths in the graph. A property of this algebra
pat,hs in our syntax. Even this is not enough. Consider the is that, when restricted to input and output data that con-
problem of finding whether “AllenJ8 acted in “Casablanca”. form to a relational (nested relational) schema, it expresses
One might try this by searching for paths from a Movie exactly the relational (nested relational) algebra. Hence an
edge down to an “AllerY edge, but one would not want SQL-like language is a natural fragment of UnQL.
t.his path to contain another Movie edge. These problems The SQL or OQL like languages we have mentioned typ-
indicate that one would like to have something like regular ically bring information to the surface, but they are not
expressions to constrain paths. capable of performing complex or “deep” restructuring of
119
,
120
I31 Mariano P. Consens and Albert0 0. Mendelzon. EX- 1301Albert0 0. Mendelzon and Tova Milo. Formal mod-
pressing structural hypertext queries in graphlog. In els of web queries.
Proc. 2nd. ACM Conference on Hypertext, pages 269- In Proc. PODS ‘97, May 1997. In press, available in
292, Pittsburgh, November 1989. ftp.db.toronto.edu/pub/papers/pods97MM.ps.
D73 Susan B. Davidson, Christian Overton, Val Tan- [311 S. Nestorov, 3 Ullman, Weiner J, and S. Chawathe.
nen, and Limsoon Wong. Biokleisli: A digital li- Representative objects: Concise representations of
brary for biomedical researchers. In Journal of Dig- semistructured hierarchical data. In Proceedings of the
ital Libraries, volume 1:1, November 1996. See Thirteenth International Conference on Data Engineer-
http://wu.cis.upenn.edu/ db. ing, Birmingham, England, April 1997.
Dl Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon [32l Y. Papakonstantinou, S. Abiteboul, and H. Garcia-
Levy, and Dan Suciu. STRUDEL: A Web Site Manage Molina. Object fusion in mediator systems. In Proc
ment System. In Proceedings of ACM-SIGMOD Inter- 2.&d. VLDB conference, September 1996.
national Conference on Management of Data, Tuscan,
May 1997. [331 Yannis Papakonstantinou, Hector Garcia-Molina, and
Jennifer Widom. Object exchange across heterogenous
PI Mary Fernandez, Lucian Popa, and Suciu Dan. information sources. In Proceedings of IEEE Intema-
A structure based approach to querying semi- tional Conference on Data Engineering, pages 251-260,
structured data, 1997. Manuscript available from March 1995.
http://wv.research.att.com/i.nfo/{mff,suciu}.
[341 D. Quass, A. Rjaraman, Y. Sagiv, and J. Ullman.
PI Mary Fernandez and Dan Querying semistructured heterogeneous information. In
Suciu. Query optimizations for semi-structured data Proceedings of the Fourth International Conference on
using graph schemes, 1996. Manuscript available from Deductive and Object-oriented Databases, pages 319-
http://wuu.research.att.com/info/{mff,suciu}. 344, dec 1995.
[‘21]H. Garcia-Molina, Y. Papakonstantinou, D. Quass,
[351 Dan Suciu. Query decomposition for unstructured
A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. The query languages. In Proc d&d. VLDB conference,
tsimmis approach to mediation: Data models and lan- September 1996.
guages. In Proceedings of Second INternational Work-
shop on Next Generation Information Technologies and 1361Jean Thierry-Mieg and Richard Durbm. ACeDB 2.
Systems, pages 185-193, June 1995. A C. elegans Database: Syntactic definitions for the
ACeDB data base manager, 1992.
P31 Roy Goldman and Jennifer Widom. Dataguides: En-
abling query formulation and optimization in semi-
structured databases. Technical report, Stanford, 1977.
121