0% found this document useful (0 votes)
6 views5 pages

Semistructured Data : Peter Buneman

The document discusses semistructured data, which contains schema information within the data itself, allowing for flexible data representation and querying. It highlights the challenges of integrating and querying such data, particularly in contexts like the World Wide Web, and introduces various models and query languages designed to handle these complexities. The tutorial aims to motivate further research in this area by exploring the implications of semistructured data on database technology and data exchange.

Uploaded by

aria.bridgeedu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views5 pages

Semistructured Data : Peter Buneman

The document discusses semistructured data, which contains schema information within the data itself, allowing for flexible data representation and querying. It highlights the challenges of integrating and querying such data, particularly in contexts like the World Wide Web, and introduces various models and query languages designed to handle these complexities. The tutorial aims to motivate further research in this area by exploring the implications of semistructured data on database technology and data exchange.

Uploaded by

aria.bridgeedu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Semistructured Data *

Peter Buneman
Department of Computer and Information Science
University of Pennsylvania
Philadelphia, PA 191046389
[email protected]

Abstract 1.1 Some data really is unstructured

In semistructured data, the information that is normally as- The most obvious motivation comes from the need to bring
sociated with a schema is contained within the data, which is new forms of data into the ambit of conventional database
sometimes called “self-describing”. In some forms of semi- technology. Some of these, such as documents with struc-
structured data there is no separate schema, in others it tured text [3, 21 and data formats [9, 171, while they may
exists but only places loose constraints on the data. Semi- call for increasingly expressive query Ianguages and new op-
structured data has recently emerged as an important topic timization techniques, only require mild extensions to the
of study for a variety of reasons. First, there are data sources existing notion of data models such as ODMG [13]. How-
such as the Web, which we would like to treat as databases ever these extensions still require the prior imposition of
but which cannot be constrained by a schema. Second, it structure on the data, and there are some forms of data for
may be desirable to have an extremely flexible format for which this is genuinely difficult.
data exchange between disparate databases. Third, even The most immediate example of data that cannot be con-
when dealing with structured data, it may be helpful to view strained by a schema is the World-Wide-Web. As database
it. as semistructured for the purposes of browsing. This tu- researchers we would like to think of this as a database, but
torial will cover a number of issues surrounding such data: to what extent are database tools available for querying or
finding a concise formulation, building a sufficiently expres- maintaining the web? Most web queries exploit information
sive language for querying and transformation, and opti- retrieval techniques to retrieve individual pages from their
mizat,ion problems. contents, but there is little available that allows us to use
the structure of the web in formulating queries, and since
the web does not obviously conform to any standard data
1 The motivation model, we need a method of describing its structure.
Another example, little known to the database commu-
The topic of semistructured data (also called unstructured
nity but responsible for piquing the author’s interest in this
data) is relatively recent, and a tutorial on the topic may
topic, is the database management system ACeDB, which
well be premature. It represents, if anything, the conver-
is popular with biologists [36]. Superficially it looks like
gence of a number of lines of thinking about new ways to
an object-oriented database system, for it has a schema
represent and query data that do not completely fit with
language that resembles that of an object-oriented DBMS;
conventional data models. The purpose of this tutorial is
but this schema imposes only loose constraints on the data.
to to describe this motivation and to suggest areas in which
Moreover the relationship between data and schema is not
further research may be fruitful. For a similar exposition,
easily described in object-oriented terms, and there are struc-
the reader is referred to Serge Abiteboul’s recent survey pa-
tures that are naturally expressed in ACeDB, such as trees of
per PI. arbitrary depth, that cannot be queried using conventional
The slides for this tutorial will be made available from a
techniques.
section of the Penn database home page
http://vnn.cis.upenn.edu/“db.
1.2 Data Integration
‘This work was partly supported by the Army Research Office
(DAAH0495-1-0169) and the National Science Foundation (CCR92- A second motivation is that of data exchange and transfor-
16122). mation, which is the starting point for the Tsimmis project
^. . [33, 211 at Stanford. The rationale here is that none of the
-*..A, I’&
ercisting data models is all-embracing, so that it is diflicult
Permission to make digital/kard copiesof all or partof thismaterialfor
to build software that will easily convert between two dis-
personalor classroomuseis granted withoutfeeprovidedthatthecopies
parate models. The Object Exchange Model (OEM) offers
arenot made or distributed for profitor commercial advantage,thecopy-
rightnotice,thetitleof thepublication anditsdateappear,andnoticeis a highly flexible data structure that may be used to cap-
given that copyright is by permission of the ACM, Inc. To copy otherwise, ture most kinds of data and provides a substrate in which
to republish,to poston serversor to redistribute to lists.requires
specific almost any other data structure may be represented. In ef-
permission and/orfee fect, OEM is an internal data structure for exchange of data
PODS ‘97 Tucson Arizona USA between DBMSs, but having such a structure invites the
Copyright 1997 ACM O-89791-910-6/97/05 .X%50 idea of querying data in OEM format directly.

117
Figure 1: An example movie database.

1.3 Browsing both with data, of types such as int and string and possi-
bly other base or external abstract types (video, audio etc.).
A final motivation is that of browsing. Generally speaking, Edges are also with names such as Movie and Title that
a user cannot write a database query without knowledge would normally be used for attribute or class names, We
of the schema. However, schemes may have opaque termi- shall refer to such labels as symbols. Internally they are rep-
nology and the rationale for the design is often difllcult to resented as strings. Note that arrays may be represented by
understand. It may help in understanding the schema to be labeling internal edges with integers. We can formulate the
able to query data without full knowledge of the schema. type of this kind of labeled tree as:
For example the queries,
type label = int 1 string 1 .,. 1 symbol
l Where in the database is the string “Casablanca” to type tree = set(labe1 x tree)
be found?
The first line describes a tagged union or variant, the
l Are there integers in the database greater than 216?
second says that a tree is a set of label/tree pairs. The edges
l What objects in the database have an attribute name out of nodes in our trees are assumed to be unordered.
that starts with “act” There are a number of variations on this basic model,
and it is worth briefly reviewing them. In [5] leaf nodes
Such questions cannot be answered in any generic fashion are labeled with data, internal nodes are not labeled with
by standard relational or object-oriented query languages. meaningful data, and edges are labeled only with symbols
While languages have been proposed that allow schema and
data to be queried simultaneously [24] in the context of type base = id 1 string I...
relational and object-oriented database systems, these lan- type tree = base 1 set(symbo1 x tree)
guages do not have the flexibiity to express complex con- The differences between the two models are minor and
straints on paths, and it is not clear how their implementa- give rise to minor differences in the query language. It is
tion will work on the structures described below. easy to define mappings in both directions.
Another possibility is to allow labels on internal nodes,
2 The Model for example:
type base = int 1 string [ . . . 1 symbol
The unifying idea in semi-structured data is the representa-
tion of data as some kind of graph-like or tree-like structure. type tree = label x set(labe1 x tree)
Although we shall allow cycles in the data, we shall gener- The problem with using this representation directly is
ally refer to these graphs as trees. The example in figure 1 is that it makes the operation of taking the union of two trees
taken from [lo] in which the data model is formalized as an difficult to define. However, by introducing extra edges,
edge labeled graph. The structure is taken (with some inac- this represaentation can be converted into one of the edge-
curacies) from a well-known web database [23] that provides labelled representations above.
a good example of semistructured data. There are several A final and more complex issue is that of object identity,
things to note about it. If one confines ones attention to the by which we mean node labels - or possibly edge labels
parts of the database below Movie edges, the data appears - that, apart from an equality test, are not observable in
fairly regular except that there are two ways of representing the query language. In OEM, object identities are used as
a cast. That is, the data does not quite fit with some re- node labels and place-holders to define trees. While object-
lational or object-oriented presentation. Edges are labeled identities provide an efficient way to define and test equality

118
within a database, they pose problems when comparing data The “select” fragment of UnQL[lO] and the Lore1 query
across databases. See [lo, 25, 321 discussions and related language [5] solve these problems with very similar syntactic
work. forms. Lore& which is a component of the Lore project [27]
It is straightforward to encode relational and object- requires a rich set of overloadings for its operators for deal-
oriented databases in this model, although in the latter case ing with comparisons of objects with values and of values
one must take care to deal with the issue of object-identity. with sets. These are avoided in UnQL by not having object
However, the coding is not unique, and the examples in identity and exploiting a simple form of pattern matching.
[IO] and [.!i]show some differences in how tuples of sets are Other languages that use a SQL-like syntax include a pre-
treated. cursor to Lore1 [34], and WebSQL [29, 71 which contains a
The term “self describing” is often used to describe un- number of constructs specific to web queries. A language
struct,ured data. In each of the models we have described, for web site management is proposed in [18].
the data is a tagged union type, and one can imagine a pro- Having asked what the surface syntax should look like,
gram whose behavior is dynamically determined by “switch- one wants to ask what the underlying computational strat-
ing” on the type. For example, a program’s behavior may egy should be. Here there appear to be two principled strate-
be altered by whether it finds an integer or string as a label, gies. The first is to model the graph as a relational database
and one would expect any language for dealing with semi- and then exploit a relational query language. In our labeled
structured data to incorporate predicates that describe the graph model this is remarkably simple. We can take the
type of an edge or node. The situation is similar to that database as a large relation of type (node-id, label, node-id)
in programming languages. Lisp and many interpreted and and consider the expressive power of relatianal languages
scripting languages are dynamically typed. Predicates are on this structure, but this apparently simple approach has
available to determine (at run time) type of a value or class a number of complications:
of an object. Languages in the Algol tradition (Pascal, C,
&fL, Modula) are statically typed. Predicates are not needed Our labels are drawn from a heterogeneous collection
to determine the type of a value because it is known from the of types, so it may be appropriate to use more than
source code of the program and hence to the programmer. one relation.
There is a good analogy between dynamic type systems and
If information also is held at nodes, one needs addi-
semistructured data on one hand, and static type systems
tional relations to express this.
and databases with schemas on the other
The node identifiers may only be used as temporary
3 Query Languages node labels, and one may want to limit the way they
can appear in the output of the query. How they are
There appear to be two general approaches to devising query used is related to the discussion of object identity.
languages for semistructured data. First, take SQL (or per-
haps OQL[l4,13]) as a starting point and add enough “fea- We are concerned with what is accessible from a given
tures” to perform a useful class of queries. The second ap- “root” by forward traversal of the edges, and one may
proach is to start from a language based on some formal want to limit the languages appropriately.
notion of computation on semistructured data then to mas- Some forms of unbounded search will require recursive
sage that language into acceptable syntax. It is remarkable queries, i.e., a Ggraph datalog”, and such languages are pro-
that the two approaches appear to end up with very similar posed in [26, 161 for the web and for hypertext. Theoretical
languages. treatments of queries that deal with computation on graphs
Let us start with the first approach to see what what or on the web appear in [6, 301. It should also be mentioned
kinds of queries are useful. The following SQL-like syntax that this model of computation is used in [5,15] as a starting
suggests itself: point for optimization.
select Entry.Movie.Title The second strategy is adopted in the basis for UnQL
from DB [ll, lo]. Here the starting point is that of structural recur-
where Entry.Movie.Director ... sion, and is an extension of a principle put forward in [12]
that there are natural forms of computation associated with
However the syntax does not make clear how much of the the type. For semistructured data one starts with the natu-
two pat,hs Entry.Movie.Title and Errtry.Movie.Director ral form of recursion associated with the-recursive datatype
are to be taken as the same. The solution is to introduce of labeled trees. However, some restri&ions need to be
variables to indicate how paths or edges are to be tied to- placed for such recursive programs to be well-defined: we
gether. These variables can then be used in other expres- want them to be well-defined on graphs with cycles. These
sions to form new structures. Label variables, tree variables restrictions give rise to an algebra that can be viewed as
and possibly path variables are needed to express a reason- having two components: a “horizontal” component that ex-
able set, of queries. presses computations across the edges of a given node (and
The next problem is that one wants to specify paths of from this, computations to a fixed depth from the root); and
arbitrary length to find, for example, all the strings in the a %ertical” component that expresses computations that go
database. This requires us to be able to express arbitrary to arbitrary depths in the graph. A property of this algebra
pat,hs in our syntax. Even this is not enough. Consider the is that, when restricted to input and output data that con-
problem of finding whether “AllenJ8 acted in “Casablanca”. form to a relational (nested relational) schema, it expresses
One might try this by searching for paths from a Movie exactly the relational (nested relational) algebra. Hence an
edge down to an “AllerY edge, but one would not want SQL-like language is a natural fragment of UnQL.
t.his path to contain another Movie edge. These problems The SQL or OQL like languages we have mentioned typ-
indicate that one would like to have something like regular ically bring information to the surface, but they are not
expressions to constrain paths. capable of performing complex or “deep” restructuring of

119
,

the data. Simple examples of such operations include delet- References


ing/collapsing edges with a certain property, relabeling edges,
or performing local interchanges. Both Ugraph datalog” and [I] Serge Abiteboul. Querying semi-structured data. In
UnQL are capable of various forms of restructuring. For ex- Proceedings of ICDT, Jan 1997.
ample, in UnQL one can write a query that corrects the
egregious error in the “Bacall” edge label. One can also [2] Serge Abiteboul, Sophie Cluet, Vassilis Christophidcs,
‘. Tova Milo, and J&me SimCon. Querying documents
perform a number of global restructuring functions such as
in object databases. In Journal of Digital Libra&s,
deleting edges with certain properties or adding new edges
volume l:l, 1997.
t.o “short-circuit” various paths. The the relationship be-
tween the restructuring possible in UnQL and what can be [3] Serge Abiteboul, Sophie Cluet, and Tova Mile, Query-
done in “graph datalog” is not understood. Some simple ing and updating the file. In Proceedings of 19th In-
forms of restructuring are also present in a view definition ternational Conference on Very Large Databases, pages
language proposed in [4]. 73-84, Dublin, Ireland, 1993.

4 Implementation and Optimizations


[4] Serge Abiteboul, Roy Goldman, Jason McHugh, Vasilis
Vassalos, and Yue Zhuge. Views for semistructured
This topic is very much in its infancy and again depends on data. Technical report, Stanford University, 1977,
the underlying representation of the data. Moreover the op-
[5] Serge Abiteboul, Dallan Quass, Jason McHugh, Jcn-
t.imization prblems differ depending on whether one is using nifer Widom, and Janet L. Weiner. The lore1 query
a semistructured model as an interface to existing data or language for semistructured data. In Journal of Dig-
one is building a data structure to represent semistructured
ital Libraries, volume l:l, 1997. To appear. See
data directly [28]. In the former case the extensions of ex-.
http://www-db.stanford.edu/pub/papers/.
isting techniques for optimization of object-oriented or re-
lational query languages mentioned above may be exploited [6] Serge Abiteboul and Victor Vianu. Queries and compu-
together with the,,addition of path or text indices on labels tation on the web. In Proceedings of ICDT, Jan 1997.
and strings. In the second case, diik layout and clustering,
together witIi appropriate indexing, is also important. [7] Gustav0 0. Arocena, Albert0 0. Mendelzon, and
In [lo] a large class of computations can be shown to George A. Mihaila. Applications of a Web query lan-
be translatable into a basic graph transformation technique guage. In Proc. 6th. Int’l. WWW Conf., April 1997. In
which, in turn, allows some simple optimizations. Also some press.
of the basic optimizations of the relational algebra can be
[S] P. Buneman, S. Davidson, Mary Fernandea, and D. Su-
applied to the “vertical” computations. In [35] it is shown
ciu. Adding structure to unstructured data. In Pro-
how an analysis of the query, combined with some segmen-
ceedings of ICDT, January 1997.
tation of the graph into local “sites” can be used to decom-
pose a query into independent, parallel sub-queries. In [5] [9] P. Buneman, S.B. Davidson, K. Hart, C. Overton, and
and [15] extensions to optimization techniques for object- L. Wong. A data transformation system for biological
oriented query languages are exploited. In [19] a translation data sources. In Proceedings of VLDB, Sept 1995.
is specified for a fragment UnQL into a an underlying rela-
tional structure. [lo] Peter Buneman, Susan Davidson, Gerd Hillebrand, and
Dan Suciu. A query language and optimization tech-
niques for unstructured data. In Proceedings of ACM-
5 Adding Structure
SIGMOD International Conference on Management of
One of the main attractions of semistructured data is that Data, pages 505-516, Montreal, Canada, June 1996.
it is unconstrained. Nevertheless, it may be appropriate [ll] Peter Buneman, Susan Davidson, and Dan Suciu. Pro-
to impose (or to discover) some form of structure in the gramming constructs for unstructured data. In Proceccl-
data. In [8] a schema is defined as a graph whose edges are ings of 5th International Workshop on Database Pro-
labeled with predicates and the property of simulation is gramming Languages, Gubbio, Italy, September 1995.
used to describe the relationship between data and schema. To appear.
In [31, 221 the schema is also an edge labeled graph and the
stronger relationship of automata equivalence is used. In [12] Peter Buneman, Shamim Naqvi, Val Tannen, and Lim-
[20] schemas are used for further optimization. soon Wong. Principles of programming with complex
Schemas are useful for browsing and for providing partial objects and collection types. Theoretical Computer Sci-
answers to queries. They will also be needed for the passage ence, 149(1):3-48, September 1995.
back from semistructured to structured data, for which a
richer notion of schema is necessary. This is an area in [13] R. G. G. Cattell, editor. The Object Database Standard:
which much further work is needed. ODMG-93. Morgan Kaufmann, San Mateo, California,
1996.

6 Acknowledgments [14] Sophie Cluet and Claude Delobel. A general frame-


work for the optimization of object oriented queries. In
.
I would like to thank Susan Davidson and Dan Suciu for M. Stonebraker, editor, Proceedings A CM-SIGMOD In-
their collaboration and for stimulating my interest in this ternational Conference on Management of Data, pages
area. I am greatly indebted to Serge Abiteboul for most 383-392, San Diego, California, June 1992.
constructive discussions on a number of issues.
[15] Sophie Cluet and Guido Moerkotte. Query processing
in the schemaless and semistructured context. Tcchni-
cal report, INRIA, 1997.

120
I31 Mariano P. Consens and Albert0 0. Mendelzon. EX- 1301Albert0 0. Mendelzon and Tova Milo. Formal mod-
pressing structural hypertext queries in graphlog. In els of web queries.
Proc. 2nd. ACM Conference on Hypertext, pages 269- In Proc. PODS ‘97, May 1997. In press, available in
292, Pittsburgh, November 1989. ftp.db.toronto.edu/pub/papers/pods97MM.ps.

D73 Susan B. Davidson, Christian Overton, Val Tan- [311 S. Nestorov, 3 Ullman, Weiner J, and S. Chawathe.
nen, and Limsoon Wong. Biokleisli: A digital li- Representative objects: Concise representations of
brary for biomedical researchers. In Journal of Dig- semistructured hierarchical data. In Proceedings of the
ital Libraries, volume 1:1, November 1996. See Thirteenth International Conference on Data Engineer-
http://wu.cis.upenn.edu/ db. ing, Birmingham, England, April 1997.

Dl Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon [32l Y. Papakonstantinou, S. Abiteboul, and H. Garcia-
Levy, and Dan Suciu. STRUDEL: A Web Site Manage Molina. Object fusion in mediator systems. In Proc
ment System. In Proceedings of ACM-SIGMOD Inter- 2.&d. VLDB conference, September 1996.
national Conference on Management of Data, Tuscan,
May 1997. [331 Yannis Papakonstantinou, Hector Garcia-Molina, and
Jennifer Widom. Object exchange across heterogenous
PI Mary Fernandez, Lucian Popa, and Suciu Dan. information sources. In Proceedings of IEEE Intema-
A structure based approach to querying semi- tional Conference on Data Engineering, pages 251-260,
structured data, 1997. Manuscript available from March 1995.
http://wv.research.att.com/i.nfo/{mff,suciu}.
[341 D. Quass, A. Rjaraman, Y. Sagiv, and J. Ullman.
PI Mary Fernandez and Dan Querying semistructured heterogeneous information. In
Suciu. Query optimizations for semi-structured data Proceedings of the Fourth International Conference on
using graph schemes, 1996. Manuscript available from Deductive and Object-oriented Databases, pages 319-
http://wuu.research.att.com/info/{mff,suciu}. 344, dec 1995.
[‘21]H. Garcia-Molina, Y. Papakonstantinou, D. Quass,
[351 Dan Suciu. Query decomposition for unstructured
A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. The query languages. In Proc d&d. VLDB conference,
tsimmis approach to mediation: Data models and lan- September 1996.
guages. In Proceedings of Second INternational Work-
shop on Next Generation Information Technologies and 1361Jean Thierry-Mieg and Richard Durbm. ACeDB 2.
Systems, pages 185-193, June 1995. A C. elegans Database: Syntactic definitions for the
ACeDB data base manager, 1992.
P31 Roy Goldman and Jennifer Widom. Dataguides: En-
abling query formulation and optimization in semi-
structured databases. Technical report, Stanford, 1977.

r23l The internet movie database. http : //us. imdb . corn/.

1241M. Kiier, W. Kim, and Y. Sagiv. Querying object-


oriented databases. In M. Stonebraker, editor, Proceed-
ings ACM-SIGMOD International Conference on Man-
agement of Data, pages 393-402, San Diego, California,
June 1992.

[‘25] Anthony Kosky. Observational properties of databases


with object identity. Technical Report MS-CIS-95-20,
University of Pennsylvania, 1995.

PI Laks V.S. Lakshmanan, Fereidoon Sadri, and Iyer N.


Subramanian. A declarative language for querying
and restructuring the world-wide-web. In Post-ICDE
IEEE Workshop on Research Issues in Data Engineer-
ing (RIDE-NDS’96), New Orleans, February 1996. See
also ftp://ftp.cs.concordia.cafpub/laks/papers.

t271 J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and


J. Widom. Lore: A database management system for
semistructured data. Technical report, Stanford Uni-
versity Database Group, February 1997.

P81 3. McHugh and J. Widom. Integrating dynamically-


fetched external information into a dbms for semi-
structured data. Technical report, Stanford University,
1997.

PI Albert0 0. Mendelzon, George A. Miiaila, and Tova


MilO. Querying the World Wide Web. In Proc.
PDIS ‘96, pages 80-91, December 1996. AvaiIable as
tp.db.toronto.edu/pub/papers/pdis96.ps.gz.

121

You might also like