OLAP693
OLAP693
net/publication/224699283
CITATIONS READS
9 93
3 authors:
Hiroyuki Kitagawa
University of Tsukuba
411 PUBLICATIONS 3,274 CITATIONS
SEE PROFILE
All content following this page was uploaded by Chantola Kit on 03 October 2014.
7
2.1. Online Analytical Processing (OLAP) System Implementation
did pid nid nnum tname value
p Pexp poccu
0 1 0 1 bookinfo null
sales bookinfo
0 2 1 1.1 c null
1 /bookinfo 1
2 /bookinfo/c 2
Path
c
c 3 /bookinfo/c/@name 4
Node table
Online Analytical Processing (OLAP) is a category of Path table
na n
me a
m 0 3 3 1.1.1.1 CDATA math
“m e
ath “
” c
s
”
3 /bookinfo/c/c 3
k k 0 4 4 1.1.2 c null
a a
c
Approach
c
c 4 /bookinfo/c/c/@name 6
n n
name na
m
“linalgebra” e
“d n
b” a
m 0 …
t s
e
“
w
ts k
e
o
b
”
o a 5 /bookinfo/c/c/b 6
i
1 9 46 1 sales Null
6 /bookinfo/c/c/b/t 12
b b b b
1 10 47 1.1 area null
b 7 /bookinfo/c/c/b/p 12
1 11 48 1.1.1 kanto null
t p t p t p t p t p t p
8 /sales 1
t q t q t q t q t q t q 1 12 49 1.1.1.1 tsukuba null
0 /bookinfo/ c/c/b/ A
0 /bookinfo/ c/c A
tuple
p 0 /bookinfo/ c/ A 0 /bookinfo /c/c/b /p A
tuple tuple
/b/p
c/b/p
Fact
/bookinfo/ c/c
/b/p
B
Dimension
0
/bookinfo/ c/c/b/
p
B
0
XML
/bookinfo/ c/
c/b/p
B 0 /bookinfo/ c/c/b/ p B
“math” area
/bookinfo/ c/c 0 C
c area
0 C p
Data-cube
c/b/p
kanto c
b
name
“linear algebra” kanto c
name
“db” kansai
cc
name
“web”
0
table
/bookinfo/ c/c
D
0
table
/bookinfo/ c/c/b/
p
D
/bookinfo/ c/
tsukuba
/b/p 0 D 0 /bookinfo/ c/c/b/ p D
c/b/p
tsukuba kyoto b
b1
b
p1 b2 p4 b b6 p5
b b
as understood by users.
/bookinfo/ c/c/b/
pp
/bookinfo/ c/c 0 E /bookinfo/ c/
0 E p 0 E 0 /bookinfo/ c/c/b/ p E
p
/b/p c/b/p
t q t q t q
XQu
/bookinfo/ c/c/b/ /bookinfo/ c/
/bookinfo/ c/c 0 F 0 F 0 /bookinfo/ c/c/b/ p F
0 F p c/b/p
/b/p
1000
2000
D
e
Exte ry w/OL XQuery
When considering OLAP, star schema, cube, and aggre- nsio
n
AP
w/OLAP SQL
Extension
Query Translation
gation operations are the most important concepts. To rep- group1 group2 group3 ---
--- --- --- ---
resent the multidimensional data model, star schema, that --- --- --- ---
total total
--- --- --- ---
consists of single fact table and some dimension tables, is --- --- --- ---
g1 g2 g3 qty g1 g2 g3 qty
--- --- --- ---
used. Each dimension table contains columns correspond- --- --- --- ---
8
4.2. Formal Definitions Concept Hierarchy The concept hierarchy is a notable
feature of traditional OLAP systems by which we can carry
To construct an XML data-cube, we first need to specify out flexible grouping operations over the data items stored
fact and dimensions. Let us look at the definitions of fact in the fact table. As with the traditional OLAP systems, we
and dimensions. assume that value-based concept hierarchies are given be-
forehand. We do not go into the detail of how to represent
such a hierarchy, due to the page limitation. When deal-
Facts about an XML Data A fact-table in a traditional ing with XML data in the same context, we need a special
OLAP system stores data items being analyzed. We at- consideration on the semistructured nature. Specifically, we
tempt to define the facts in an XML data after the traditional have to take into account structure-based concept hierarchy
OLAP way. In order to identify the facts, we use XPath as which is naturally represented as the hierarchical structure
the query language. For example, when a user wants to get of XML data.
information of book sales from sales XML data as in the Taking Figure 2 for example, all books (b) are catego-
upper left side of Figure 2, the related data items can be ob- rized by the XML hierarchies according to the area or book
tained by the fact path pf = doc("sales.xml")//b. category. The structure-based concept hierarchy allows us
to aggregate facts using such XML data structure. We will
discuss the detail later.
Definition 1 (Fact path) A fact path (pf ) is an absolute
XPath expression that identifies data items of interest.
Data Cube on XML Data We are now ready to define
data cube on XML data using the concepts of the fact and
Dimensions Having fixed the fact data, we might addi- dimension paths. Before going into the definition, we intro-
tionally need some dimensions whose values are used to duce some notations as helpers. For a given XPath expres-
group the facts together for the subsequent aggregation op- sion p, [[p]] denotes an evaluation of p, and the result would
erations. In traditional OLAP systems, dimensions are be XML nodes, string-values, or a boolean. Let [[p]] denotes
given as independent tables associated with the fact table. an evaluation of p where p represents an XPath expression.
In this work we try to define a dimension as an XPath query,
but we need to care about the relationship between the fact Definition 3 (XML data-cube) An XML-cube is defined
data and dimensions. In order to ensure this, a dimension as (pf , D) where pf is a fact path and D =
path is in either of the two cases: relative path from the fact {pd1 , pd2 , . . . , pdn } is a set of dimension paths. A fact f in
path and absolute path with referential constraints. the cube is an n + 1-tuple (f, d1 , . . . , dn ) where f ∈ [[pf ]]
and each di is obtained by evaluating pdi : [[pdi ]]f if pdi is
in a relative form or [[pdi ]] where pdi can be obtained by
Definition 2 (Dimension path) A dimension path is an
replacing each occurrence of pf /pr in pdi with [[pr ]]f . n is
XPath expression (pd ) in either of the two forms:
the rank of the XML-cube.
1. pd is a relative path expression originated from the fact Let us consider an XML data-cube as an ex-
path pf , or ample (Figure 2). It is defined as (pf , {pd }),
where pf =doc("sales.xml")//b and
2. pd is an absolute path expression contains at least one pd =doc("bookinfo.xml")//b[t =pf /t]/p.
condition with the fact path pf . A tuple can be extracted as follows. Firstly, fact
data can be extracted by evaluating fact path like
Figure 2 shows an example of fact and dimension paths. [[pf ]] = {b1 , b2 , . . . , b6 }. For each fact data bi , we
The circles on the top left document represent the facts can identify corresponding dimension data in another
corresponding to pf . When we want to use the book ti- XML data as specified by pd . When evaluating pd , we
tle as a dimension for the subsequent analysis, a dimen- need to rewrite the path according to the fact data. For
sion path can be given as pd1 =t, which is a relative path example, for the fact b1 , pf /t, which is a part of pd ,
from pf . If we are interested in grouping the books ac- is rewritten as [[pf /t]]b1 = {"A"}, that turns out to
cording to price ranges represented in another XML data be doc("bookinfo.xml")//b[t = "A"]/p.
(the upper right document of Figure 2), we need to spec- In this way, we can extract all tuples from
ify absolute path expression with referential constraints like the data cube, that are set of 2-tuple:
pd2 =doc("bookinfo.xml")//b[t = pf /t]/p. {(b1, p1), (b2, p4), (b3, p3), (b4, p3), (b5, p2), (b6, p5)}.
As can be seen from the example, for a given book, we can In contrast to the existing OLAP, and XML-cube may
obtain corresponding price in another XML data by using contain much information more than the dimensionality
title as the clue. (what we call “rank”). That is, each XML fragment may
9
sales.xml bookinfo.xml
xmlcube
tuple tuple tuple
sales bookinf sales bookinfo sales bookinfo
area
kanto
co
name
“math”
name
area c
name
“cs”
name
area c
name
“cs”
name
5. Implementation Using Relational Database
c “linear algebra” kanto c “db”
kansa
c “web”
tsukuba
b tsukuba
b
i
kyoto
b
Systems
b1 b p1 p b2 b p4 p b6 b p5 p
t q t q t q
A 10 1000 D 20 2000 F 30 3400
This section discusses an implementation of the pro-
posed model and grouping operations (Figure 1, right). We
Figure 2. Facts, dimensions, and Sales XML try to make the best use of relational databases as the under-
data-cube. lying data storage. The reasons are: 1) there are many com-
mercial and open source products, 2) enormous amount of
information resources are stored in relational systems, and
3) we can leverage established relational XML storage tech-
contain more information than a numerical value, such as niques. In addition, we can utilize grouping functionalities
elements, texts, attributes, and hierarchical information. In which are supported in most relational database systems,
order to form a cube-like structure, we need to specify some to implement value- and structure-based grouping of XML
of them as dimensions of the cube structure. data.
For instance, each tuple of the rank 1 XML data-cube in
Figure 2 (lower side) contains two XML fragments of books
coming from “sales.xml” and prices from “bookinfo.xml”. 5.1. Relational XML Storage
According to the fragments, this XML data-cube potentially
has five attribute values: title, quantity, area, price, and cat- We employ the path-approach [8] for mapping XML
egory. Assume that we are interested in getting the informa- data to relational tables, because we can manage any well-
tion related to the book sales area and price, we can create formed XML documents with fixed relational schema and
a cube by specifying the area and price as the dimension. realize practical subset of XPath solely by the use of SQL
functionalities. Due to the limitation of pages, we just show
a brief overview. In the path-approach, an XML node is ba-
4.3. OLAP Extensions to XQuery
sically mapped to a relational tuple of two tables, path table
containing all absolute path expression of all XML nodes,
Once the data cube is constructed, we perform multidi- and node table containing all XML node information. Ta-
mensional analysis using the dimensions and related infor- ble 1 (left) shows the path table extracted from “sales.xml”
mation such as XML hierarchies. In our system, we at- and “bookinfo.xml”. In the node table (Table 1, right), there
tempt to use XQuery as the user query language. However, are document id (did), pid (path id), nid (node id), nnum
the current version of XQuery does not support aggrega- (node number), tname (tage name), and value.
tion function. So we employ the syntax of OLAP extension
for XQuery [1], “GROUP BY ROLLUP” and “GROUP BY
5.2. Extracting Fact and Dimensions
TOPOLOGICAL ROLLUP”. The same as the roll-up op-
eration in ordinary OLAP systems, “ROLLUP” enables a
“SELECT” statement to calculate multiple levels of subto- The first step is to extract fact and dimensions. As dis-
tals across a specified group of dimensions. It also cal- cussed in Section 4.2, a fact and its dimensions are XML
culates a grand total. “ROLLUP” is an extension to the sub-trees specified by XPath queries. Hence, we can repre-
“GROUP BY” clause so its syntax is extremely easy to sent the fact (or a dimension) as a part of node table. This
use. The latter, “TOPOLOGICAL ROLLUP”, is similar can be achieved by evaluating the fact (dimension) path, and
to “ROLLUP” but for computing structure-based grouping storing the result as a new table. Those tables can be defined
over XML data. as either views or materialized views.
10
5.3. Data Cube Construction the same 3-depth prefixes, “/sales/area/kanto”
and “/sales/area/kansai”.
In the next we create an XML data-cube. For this pur- In fact, the proposed grouping operation can be imple-
pose, we need to establish the relationships between the fact mented in many ways, but an important remark is that
and the dimension as described in Section 4.2. We join it can be realized solely by the functionality of SQL.
the base relations by giving the referential constraints as One possible way is to leverage the string match func-
the join key. XML data-cube table containing all attributes tionality provided by the database system. More pre-
from the fact and dimension, and each record consists of cisely, we can make use of regular expressions to ex-
data from the fact and dimension which have the same book tract substrings, and use them with the “GROUP BY”
title. clause. Assume that we would like to use the first two
tags to group the facts, e.g., use “/sales/area” out of
5.4. Query Processing “/sales/area/kanto/tsukuba/b”, we can achieve
this by:
As discussed in Section 4.3, we use XQuery with OLAP SELECT ...
extensions as the user query language. In order to process FROM ...
a query, we need to translate the query into SQL, because WHERE ...GROUP BY regexp_replace(dim.pexp,
’ˆ(/[ˆ/]+/[ˆ/]+)/.+’, ’\\1’)
we make use of relational database systems as the query
processing engine. In fact, there have been several works Another possibility is to introduce dedicated indexes
on XQuery to SQL query translation [3], and we can borrow based on Dewey encodings or prime numbers. They might
those ideas. So, in this paper, we focus on how to implement be good for speeding up the grouping operations compared
OLAP operations using SQL. Specifically, we discuss how to the above approach. The comparison might be an inter-
to realize structure-based grouping and roll-up operations. esting topic to research.
11
100,000,000.000
concepts of fact path, dimension path, value- and structure-
10,000,000.000
based concept hierarchy, and XML data-cube. We then dis-
1,000,000.000
cussed OLAP extension to XQuery. For the implementa-
Time (ms)
100,000.000
tion issues, we use the path approach for mapping XML
10,000.000
data to relations, and we utilize “UNION ALL” to perform
1,000.000
“GROUP BY ROLLUP” operation for both structure- and
100.000
value-based groupings. Our experiments with large collec-
10.000
tions of XML data show that the “GROUP BY ROLLUP”
10MB 100MB 200MB 300MB 400MB 500MB queries perform less than 10 sec. for 500MB XML data.
File size The results show the effectiveness of our proposed tech-
nique.
item quantity payment For the future research, we try to improve the perfor-
payqty rollup(payment) rollup(region)
rollup(regionpay) mance of data-cube construction. We also plan to investi-
gate how to incorporate textual features such as word vec-
tors of XML data into the analytical processing.
Figure 3. Query processing time.
8. Acknowledgments
6.2. Benchmark Queries
This research is partly supported by the Grant-in-Aid
For the benchmark query, we give a fact path, pf = for Scientific Research (17700110) from Japan Society for
doc("xmark.xml")//item, and two dimension paths, the Promotion of Science (JSPS), Japan, and the Grant-in-
pd1 = quantity and pd2 = payment. Aid for Scientific Research on Priority Areas (18049005)
We ran three queries to show the performance of roll- from the Ministry of Education, Culture, Sports, Science
up functions which we can calculate the total quantity of and Technology (MEXT), Japan.
item grouped by value-based (payment), structure-based
(region), and the combination (regionpay). References
12