2012 IEEE
201226th
IEEEInternational
26th International
ParallelParallel
and Distributed
and Distributed
Processing
Processing
Symposium
Symposium
Workshops
Workshops
& PhD Forum
Towards Parallel Spatial Query Processing for Big Spatial Data
Yunqin Zhong1 2 ∗ , Jizhong Han1 , Tieying Zhang1 2 , Zhenhua Li3 , Jinyun Fang1 , Guihai Chen4
1
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
2
Graduate University of Chinese Academy of Sciences, Beijing, China
3
Peking University, Beijing, China
4
Shanghai Jiaotong University, Shanghai, China
∗
Corresponding author, e-mail:
[email protected]
Abstract—In recent years, spatial applications have become value of alphanumeric attributes but also on the spatial
more and more important in both scientific research and in- location, extent and measurements of spatial objects in a
dustry. Spatial query processing is the fundamental functioning reference system. Therefore, spatial query processing over
component to support spatial applications. However, the state-
of-the-art techniques of spatial query processing are facing big spatial data requires intensive disk I/O accesses and
significant challenges as the data expand and user accesses spatial computation.
increase. In this paper we propose and implement a novel The state-of-the-art techniques of spatial query processing
scheme (named VegaGiStore) to provide efficient spatial query mainly include SDB (spatial database) [2] and KVS (key-
processing over big spatial data and numerous concurrent user value stores). SDB provides spatial query language (i.e.
queries. Firstly, a geography-aware approach is proposed to
organize spatial data in terms of geographic proximity, and this spatial SQL) [3], and performs well when handling relatively
approach can achieve high aggregate I/O throughput. Secondly, small spatial datasets in megabytes or gigabytes [4]. How-
in order to improve data retrieval efficiency, we design a two- ever, since spatial queries are usually both I/O intensive and
tier distributed spatial index for efficient pruning of the search computing intensive, e.g., a single query may take minutes or
space. Thirdly, we propose an “indexing + MapReduce” data even hours in SDB [5], the I/O and computation capabilities
processing architecture to improve the computation capability
of spatial query. Performance evaluations of the real-deployed of SDB can hardly meet the high performance requirement
VegaGiStore system confirm its effectiveness. of spatial queries over big spatial data. The emerging KVS
systems, such as Bigtable [6], HBase [7] and Cassandra
Keywords-spatial data management; distributed storage; s-
patial index; spatial query; spatial applications; [8], are proved to be feasible alternatives to store big semi-
structured data for its scalability. They has been adopted in
some I/O intensive applications, e.g., Bigtable has been used
I. I NTRODUCTION
to store satellite imagery for Google Earth [6]. However,
In recent years, spatial applications such as Web-based the data in key-value stores are organized regardless of
Geographical Information System (WebGIS) and Location- geographic proximity, and they are indexed by key-based
Based Social Networking Services (LBSNS) have become structure (e.g., B+ tree) rather than spatial index. Therefore,
more and more important in both scientific research and KVS cannot process spatial queries efficiently.
industry. Spatial query processing is the fundamental func- Driven by the above problems, in this paper we propose
tion component to support spatial applications. However, and implement a novel scheme (named VegaGiStore) to
the state-of-the-art techniques of spatial query processing provide efficient spatial query processing over big spatial
are facing significant challenges as the data expand and data and numerous concurrent user queries. Firstly and most
user accesses increase [1]. With the development of earth importantly, we propose a geography-aware data organiza-
observation technologies, the spatial data are growing ex- tion approach to achieve high aggregate I/O throughput. The
ponentially year by year (currently in a petabytes scale), big spatial data are partitioned into blocks according to the
and their categories are becoming more diverse including geographic space and block size threshold 1 , and these blocks
multi-dimensional geographic data, multi-spectrum remote are uniformly distributed on cluster nodes. Then the geo-
sensing imageries, high-resolution aerial photographs, and graphically adjacent spatial objects are stored sequentially
so on. Besides, as spatial applications become more popular, in terms of space filling curve which could preserve the
concurrent user accesses to spatial applications are becoming geographic proximity of spatial objects. In practical spatial
highly intensive. applications, most clients only focus on a relatively small
The spatial data objects are generally nested and more area and query for adjacent spatial objects within the area.
complex than basic data types(e.g., string). They are stored Thereby concurrent clients can be served in parallel by
as multi-dimensional geometry objects, e.g., points, lines 1 The block size threshold is the maximum size of a block. The partition-
and polygons. Moreover, the spatial query predicates are ing process does not finish until the total size of spatial objects within a
complex. Typical spatial queries are based not only on the partitioned region is smaller than the threshold.
978-0-7695-4676-6/12 $26.00 © 2012 IEEE 1982
2079
2085
DOI 10.1109/IPDPSW.2012.245
different cluster nodes and adjacent spatial objects can be II. PARALLEL S PATIAL Q UERY P ROCESSING S CHEME
streamed to clients sequentially without random I/O seeks. A. Geography-aware Spatial Data Organization Approach
Secondly, in order to improve data retrieval efficiency, 1) Spatial Data Partitioning: We propose a geography-
we design a two-tier distributed spatial index for efficient aware quadripartition method to partition a large map layer.
pruning of the search space. The index consists of Quadtree- The scheme is designed to guarantee that data within a
based [9] global index and Hilbert-ordering local index, partitioned region are stored on one node, and all spatial
where the global index is used to find data blocks and local data are distributed across cluster according to geograph-
index is used to locate spatial objects. ical space. Spatial data objects are logically or physically
Thirdly, we propose an “indexing + MapReduce” data organized in multi-scale map layers. Spatial object has three
processing architecture to improve the spatial query compu- attributes: ID(identifier), MBR(Minimum Bounding Rectan-
tation capability. This architecture takes advantage of data- gle) and object value. A map layer also has three attributes:
parallel processing techniques to provide both intra-query unique name, MBR and resolution. MBR is an expression
parallelism and inter-query parallelism, and thus can reduce of the maximum extents of a 2-dimensional spatial object.
individual spatial query execution time and afford a large MBR is frequently used as an indication of the general
number of concurrent spatial queries. position of spatial object, and it is used as spatial metadata
We have implemented VegaGiStore on top of Hadoop for first-approximation spatial query and spatial indexing
[10], an emerging open-sourced cloud platform. VegaGiS- purpose. Therefore, Spatial applications could access spatial
tore can support numerous concurrent spatial queries for data within different regions from different nodes to provide
various spatial applications like Web Mapping Services spatial information services for numerous users.
(WMS), Web Feature Services (WFS) and Web Coverage
Service (WCS) [1]. Compared with the traditional meth- Input: Region(i.e., MBR of map layer)
ods, VegaGiStore improves the average speed-up ratio by Output: Partitioned Subregions
1: Initiate(region)
70.98% − 75.89% when processing spatial queries on a 17- 2: M SIZE ← 64M B
node cluster, and its average spatial query performance is in- 3: {0, 1, 2, 3} ← {N W, N E, SE, SW }
creased by about 10.3−13.5 times better than that of single- 4: Boolean isValid ← Verify(region)
5: if isValid then
node spatial databases. Moreover, its average I/O throughput 6: for i=0 to 3 do
is improved by 99%−235% than that of compared key-value 7: subregion[i] ← Partition(region)
stores. In summation, our contributions in this paper can be 8: end for
9: else
summarized as follows: 10: exit(0)
11: end if
1) We present a feasible scheme for efficient processing 12: for i=0 to 3 do
of spatial queries over big spatial data. We tack- 13: Verify_Partition(subregion[i])
le the problem through three significant approaches: 14: end for
geographical-aware organization approach for high
Figure 1. Verify_Partition(region). Procedure of partitioning a region
I/O throughput; two-tier distributed spatial index for
data retrieval efficiency; “indexing + MapReduce” The partitioning process is described as follows, and the
spatial querying architecture for parallel processing. respective pseudo code is shown in Fig.1.
Our scheme can be easily integrated into a cloud 1) Input and Initialization Process. Input a map lay-
computing platform (e.g., Hadoop [10]) to support er/region and compute the size of objects including
parallel spatial query processing. the real data size and additional indices size.
2) We have implemented a spatial data management 2) Verification Process. If the region size is larger than
system termed VegaGiStore on top of HDFS (Hadoop threshold M SIZE, then set isValid flag to “TRUE”,
Distributed File System) [11] and MapReduce frame- and go to Step 3; else the region size is not smaller
work [12]. VegaGiStore provides multifunctional spa- than M SIZE, set isValid to “FALSE” to indicate that
tial queries which most key-value store systems do it need not to be partitioned.
not have, and it is transparent to spatial applications. 3) Partition Process. Partitioning the region into four
Besides, the system evaluations show that VegaGiStore quadrants according to its MBR, and each quadrant
outperforms spatial databases and emerging key-value represents one subregion.
stores while processing concurrent spatial queries from 4) Computing the size of four subregions respectively,
numerous clients in practical spatial applications. and go to Step 2 to verify each subregion recursively
The rest of the paper is organized as follows. Section II and determine whether the subregion should be further
details the parallel spatial query processing scheme. Section partitioned or not.
III presents the performance evaluation. Section IV reviews 5) The partitioned process will be executed recursively
the related work. Finally, Section V concludes this paper. until all subregions are not larger than M SIZE.
2086
2080
1983
6) If all partitioned regions satisfied valid requirements,
!"
return “0”; else terminate the partition procedure.
Ă
Ă
Ă
#
According to the principle of geographic proximity, s-
patial objects within a region are combined into one data
block, so the threshold size M SIZE should be set as large Figure 4. Structure of SOFile for raster data model. SOFileRaster is
designed for raster data placement, which contains local indices header
as the HDFS block size in order to guarantee spatial data and raster objects.
within a region are stored on the same node, typically set
to 64MB, and it can be varied according to dataset amounts
and cluster scale. Otherwise, the spatial data within a region Each SOFile consists of geographically adjacent spatial
may be stored on more than one node, which will reduce objects within a specific subregion, and one SOFile oc-
data retrieval performance. cupies one data block. Since there are two convention-
According to the partition procedure, three deductions are al spatial data models in spatial applications, we have
described as follows. design two different structures of SOFile for raster tiles
• Let κ denote the size of square region, and there are and vector geometry objects, respectively. The structure of
2κ × 2κ spatial objects in the region whose size is κ. SOFile for raster data model is shown in Fig.4, which is
• The upper-left point is defined as the first object of called SOFileRaster. Moreover, Fig.5 shows the structure of
region. SOFile(termed SOFileVector) for vector data model.
• The first κ bits of coordinate(x,y) of the first ob-
$%
&'
( )%
ject are “0”, i.e., x = xn · · · xκ 00 · · · 0 and y = *
+
&'
(
&'
(
&'
(
&'
(
!"
yn · · · yκ 00 · · · 0, where n denotes size of the parent
(
Ă
(
Ă
(
(
region. The higher (n − κ) bits of coordinates of
&'
( &'
(
(
Ă
(
#
objects within the region are identical, which is defined
as region code, i.e., region code is represented as Figure 5. Structure of SOFile for vector data model. SOFileVector is
(yn xn )(yn−1 xn−1 ) · · · (yκ xκ ). designed for vector geometry object placement, which contains local indices
header and WKB objects.
Fig.2 shows an example of partitioning a region by
quadripartition scheme. The region size κ = 4, its sub-
Both SOFileRaster and SOFileVector are inherited from
regions are represented by solid line squares, it contains
SOFile structure, which includes local index header and
24 × 24 = 256 spatial objects which represented by dotted
real data part. Since the raster data and vector data have
square.
different function for spatial queries, we design different
index structure for the two spatial data models. The local
index header is the main distinction between SOFileRaster
and SOFileVector, which will be described in SectionII-B2.
The local index header contains meta data information of
block and index items of spatial objects; the data content
part contains real data of spatial objects within the region.
Moreover, the size of SOFile is the sum length of indices and
real data part. The spatial objects are organized in Hilbert
order and assigned unique HC(Hilbert Code), and adjacent
Figure 2. Quadripartition Figure 3. Hilbert-order storage. spatial objects are stored on sequential disk pages so that
it can guarantee geographic proximity and storage locality.
2) SOFile: We design a spatial objects placement struc- Fig.3 shows an example that spatial objects within region
ture termed SOFile(Spatial Object File). The SOFile is creat- R31 are stored in Hilbert order.
ed during partitioned process, and the spatial objects within The leaf node of global index tree is pointed to a data
a subregion are stored in a SOFile named by the subregion’s block file whose suffix is “.sof”(spatial object file) on
GC value. Moreover, the raster data are stored as tile objects HDFS, and the non-leaf node represents a region that should
in SOFile, whereas the vector data are stored as WKB(Well- be partitioned into four smaller subregions for its size is
Known Binary) objects. Taking geographic proximity into larger than threshold M SIZE. Fig.6 shows an example
consideration, the geographically adjacent objects should of hierarchical directory structure details of region(κ = 4)
stored in sequential disk pages. Spatial objects within a stored on HDFS, which is corresponding to quadripartition
partitioned subregion are stored into the SOFile by space schematic shown in Fig.2. The ellipse represents storage
filling curve, and they are organized in Hilbert order instead directory corresponding to non-leaf node, and rectangle
of Row-wise order or Z order because it has better locality- represents data block file corresponding to leaf node. HDFS
preserving property [1]. creates one block for each file, and file blocks are distributed
2087
2081
1984
,-./ ,-,
where s and κ denote the size of region and its subre-
gions, (x, y) denotes the coordinates of objects, and ci ∈
{0, 1, 2, 3}.
)
%
s
3 4
4
5
GC = (2yκ + xκ ) × 4s−κ (1)
κ=1
334
34
34
354
53 54
54
554
-
5334
4- According to (1), each region has an unique GC value
0"""1 5334
534
534
5354
#,
2#,
* used to construct global index. As shown in Fig.2, the
#,
#,
6 #,
Ă #,
#,
, quaternary numerics denote GC values of regions, e.g.,
0"""
!" #
0
!" #
0
!" #
0
!" #
0
!" # region R300 = 303, R301 = 301, we can derive that the
ĂĂ
0
ĂĂ
0
ĂĂ
0
ĂĂ
0"""
ĂĂ
0
GC value of their parent node is 30.
!" # !" # !" # !" # !" #
Since the non-leaf node of global index tree only pointed
by its GC value, the size of global tree is very small and the
Figure 6. Hierarchical structure for spatial data on HDFS(κ = 4) global index is resident in memory during retrieval process.
Besides, <GC,MBR> pairs of regions are maintained in
the HashMap structure, which are used to obtain MBR
across cluster nodes for load balancing. As shown in Fig.6, information for further spatial query computation.
the directory hierarchy is quadtree-like structure, the root 2) Local Index: The local index is created when subre-
node of quadtree represents the root directory identified gion data is written into SOFile, and indices data are stored
by “Global Code”, and its four children nodes represent in the SOFile as well. Therefore, the leaf nodes of global
subdirectories and “.sof” files. quadtree are pointed to the header of spatial object file.
B. Two-tier Distributed Spatial Index The local index is used for indexing spatial objects within
SOFile, and the local index header is illustrated as follows.
The VegaGiStore system must be able to retrieve from a
• Metadata information. For the SOFile structure, the
large collection of objects in some space those lying within
a particular area without scanning the whole datasets, so the first word is reserved for data version; the second and
spatial index is mandatory. In order to improve spatial data third words are (x, y) coordinate of 1st object; the
access performance and optimize spatial queries, we propose fourth word is the κ value of the region; the region
a scalable distributed spatial index to accelerate positioning is determined by its κ value and coordinate(x,y) of
spatial objects on HDFS. Considering geographic proximity the first tile object while processing raster data. For
and storage locality, the geographically adjacent data should the SOFileVector structure, the first four words are
be stored into the same node. MBR(Minimum Bounding Rectangle) information of
Our proposed distributed spatial index is a two-tier scal- the region represented by four double values; the fifth
able index including global index and local index. There are word is HC(Hilbert Code) value of the first WKBobject;
two salient features of the spatial index. The global index is the sixth and the seventh word is GC value and κ value
based on the revised distributed quadtree index [13], which of region, respectively.
• Index item. The index item contains two fields: offset
is used to determine the data block location. The local index
is built by space filling curve and is used to locate spatial and length. It means that local index of each spatial
objects within a block. Moreover, the distributed index is object is corresponding to a <offset, length>pair, and
designed and tuned for spatial applications, which is oriented the index items of spatial objects are written into block
to improve spatial data retrieval efficiency on HDFS. sequentially.
κ κ
• Indices length. There are 2 × 2 objects, and index
1) Global Index: The global quadtree index is created
length of object is 8 bytes, so the total length of
during quadripartition process. The large map layer is par-
file indices is 22κ+3 bytes. Thus the index length of
titioned into four quadrants recursively until all subregions
SOFileRaster and SOFileVector is (22κ+3 + 12) bytes
are satisfied the threshold. Meanwhile, all spatial objects
and (22κ+3 + 24) bytes, respectively.
belong to the map layer is partitioned according to their
geographical space, and adjacent objects are sequentially C. “Indexing+MapReduce” Data Processing Architecture
stored into a SOFile. Once a large map layer is split into
several subregions, the spatial data are partitioned into many We propose an “indexing + MapReduce" data processing
data blocks and uniformly spread across HDFS DataNodes. architecture to improve the spatial query computation ca-
The global index is quadtree-based, and the glob- pability of VegaGiStore. This architecture takes advantage
al tree structure is represented by Global Code (GC). of data-parallel processing techniques to provide both intra-
GC is quaternary code, where GC = c1 c2 , · · · , cs = query parallelism and inter-query parallelism, and thereby
y1 x1 y2 x2 , · · · , ys xs . GC value can be computed by (1), can reduce individual spatial query execution time and
2088
2082
1985
provide a large number of concurrent spatial queries. Our transferred to SpatialQueryMapper and they are par-
scheme is specific to spatial queries including spatial se- allel processed by TaskTrackers on the cluster nodes.
lection, spatial join and nearest neighbors, and the spatial This process obtains the <ID,WKBobject> pairs that
queries are processing in multiple phases. The first filter satisfying the query conditions.
phase prunes non-qualified objects with spatial index to • Reduce task. the satisfied <ID,WKBobject> pairs are
obtain candidate intermediate sets, and then the qualified transferred to Reducer. SpatialQueryReducer executes
candidate objects are transferred as the input of refinement the complex spatial relationship computation for the
phase. Finally the spatial relation computation examines the final query results.
actual object representation to determine the query results.
1) MapReduce-based Spatial Query Operator: In VegaG-
)
!"
iStore, we have implemented several spatial query operators
using the map/reduce paradigm. The spatial query operators
!"
!"
!"
are classified into three categories: spatial selection, spatial .67 .67 .67
join and NN(Nearest Neighbor). Moreover, the spatial se-
lection queries contain point query, range query and region .67!/./ .67!/./ .67!/./
query, where the region query includes rectangle query, cir-
.67
.67
.67
cle query and polygon query. Besides, the NN query consists
of k-NN(k-Nearest Neighbor). In addition, the spatial query ./89*// ./89*// ./89*//
algorithms are encapsulated into spatial query operators, and
./89 ./89
these operators are packaged as map/reduce spatial query
library. Therefore, an arbitrary complex spatial query can 89
be implemented by a combination of these query operators.
2) Parallel Execution of Spatial Query: Our scheme takes Figure 7. Spatial query processing architecture of VegaGiStore. The filter
phase searches the global index and outputs candidate SOFile sets; then
advantage of data-parallel processing techniques so that it these candidate sets are processed in parallel by a map-reduce job at the
could provide both inter-query parallelism and intra-query refinement phase.
parallelism. The inter-query parallelism is obtained by paral-
lel executing multiple spatial queries as independent jobs so Since complex spatial query can be combined by several
that it can support a large number of concurrent clients. The spatial query operators, and these operators are map-reduce
intra-query parallelism can be obtained by parallel execution based, the complex spatial query can be executed in parallel
of two independent phases within an individual spatial query. on many nodes. Besides, a large number of concurrent
As shown in Fig.7, the spatial query are processing in two spatial queries can be executed simultaneously. Therefore,
phases, which includes filter phase and the refinement phase. VegaGiStore could achieve high throughput performance for
The filter phase searches the global index and obtains the spatial query processing over big spatial data.
candidate SOFile sets, and these candidates are parallel
III. P ERFORMANCE E VALUATION
processed by a map-reduce job at the refinement phase.
The details of spatial query execution in VegaGiStore are A. Experiment Environment
described as follows. Our experiments are conducted on a cluster of 17 com-
Firstly, the filter operation prunes non-qualified spatial modity servers that spread across two racks(i.e., RACK1
objects simultaneously by searching the global index, and & RACK2). RACK1 consists of 8 nodes, and each node
returns the candidate SOFile sets. Since the global index has two quad-core intel CPU 2.13GHZ, 4GB DDR3 RAM,
is kept in memory and retrieved by GC(Global Code) of 15000r/min SAS 300GB hard disk. RACK2 consists of 9
global quadtree, the filter phase will be finished in several nodes and each node has a Intel Pentium 4 CPU 2.8GHz,
milliseconds. The outputs of this phase are GC values of 2GB DDR2 RAN, 7200r/min SATA 80GB hard disk. All
SOFiles that matches the query requirements, and the can- nodes are connected through Gigabit Ethernet switchers.
didate SOFile sets are used as the input of next refinement Software configurations are detailed as follows. All nodes
phase for further computation. have identical CentOS 5.5 server edition (kernel 2.6.18), Lin-
Secondly, the candidate SOFile sets are interpreted into ux Ext3 and JDK-1.6.0_20. PostgreSQL-9.0.5 cluster, bare
<ID,object> pairs and processed by a map-reduce job at Hadoop-0.20.2, Cassandra-0.7.6, HBase-0.20.6 and VegaGi-
the refinement phase. Since the map-reduce framework Store are deployed on the cluster. Moreover, Zookeeper-3.3.3
relies on the InputSplit and RecordReader, we implemen- is deployed on 7 nodes to maintain configuration information
t SOFileInputSplit and SOFileRecordReader to generate and distributed synchronization. Besides, we also deploy
<ID,WKBobject> pairs for Mapper. The map and reduce two spatial databases in RACK1, i.e., commercial Oracle
procedures are described as follows. Spatial + Oracle database cluster and open-sourced PostGIS
• Map task. The generated <ID,WKBobject> pairs are + PostgreSQL cluster.
2089
2083
1986
B. Test Items and Datasets
As already mentioned, spatial queries should process large
amounts of spatial data, and the spatial query efficiency
is heavily depended on both I/O and spatial computation
performance, hence we evaluate spatial query performance
in terms of two categories, including I/O metrics and spatial
query metrics. We evaluate the I/O performance by three
frequently-used I/O operations in spatial applications, which
includes random reads, sequential reads and bulk loading.
Besides, the spatial query efficiency is evaluated by conven-
tional operations, including spatial selection query, spatial
join and k-NN query.
The real spatial dataset is about 1.379TB and consists of Figure 8. Random reads performance.
raster and vector datasets, which covers eight map scales
with highest resolution is 1 : 5000. The raster dataset
contains about 128, 323, 657 file-based tiles, and each tile low latency random access for spatial applications involving
ranges from several bytes to tens of KBs. The vector dataset a large number of concurrent reads.
consists of geometry objects: (a) TLP contains 314, 851, 774
2) Sequential Reads Operation: We also conduct six
point objects; (b) TLL contains 81, 991, 436 line objects; (c)
groups of test with size from R(20, 20) to R(80, 80) for
HYP contains 16, 749, 181 polygon objects.
sequential reads evaluation, and then each test case is
C. Reads Operations repeated for 10 times, finally collect the average results.
We evaluate two reads operations: random reads and
sequential reads, which are used in different application
scenarios. Random reads operation is often used for random
access of spatial objects within a small region, e.g., reading
the spatial object of given location <longitude, latitude>;
sequential reads operation is used to sequentially access
adjacent spatial objects within a map layer, e.g., reading all
geometry objects within specific map layer.
Let R(lon, lat) denote that reading (lon × lat) spatial
objects within region R, e.g., R(1, 1) means reading one ob-
ject, and R(80, 80) means reading 6400 spatial objects. We
conduct six groups of comparative experiments for random
reads and sequential reads, respectively. The comparisons
are VegaGiStore and four other typical systems, including Figure 9. Sequential reads performance.
PostgreSQL cluster, bare HDFS, Cassandra and HBase.
1) Random Reads Operation: The random reads perfor- As shown in Fig.9, the average sequential reads perfor-
mance is evaluated by reading spatial objects with size from mance of VegaGiStore is about 198%, 856%, 336%, 309%
R(1, 1) to R(8, 8). better than that of PostgreSQL cluster, bare HDFS, Cas-
As shown in Fig.8, the average random reads performance sandra and HBase, respectively. Moreover, the VegaGiStore
of VegaGiStore is increased by about 79%, 338%, 96%, 89% performs better when reading more geographically adjacent
than that of PostgreSQL cluster, bare HDFS, Cassandra and spatial objects, e.g., it cost only 112ms when reading 6400
HBase, respectively. spatial objects from VegaGiStore, yet the respective time is
Since bare HDFS is only tuned for streaming large files, it 523ms, 1187ms, 593ms and 583ms for PostgreSQL cluster,
performs worst while randomly reading small spatial objects. bare HDFS, Cassandra and HBase.
PostgreSQL cluster performs better than key-value stores be-
VegaGiStore outperforms compared systems in reads
cause it has spatial index. Moreover, VegaGiStore performs
micro-benchmarks because it benefits from geography-aware
even better while randomly reading more spatial objects,
data organization scheme. VegaGiStore organizes the ge-
e.g., VegaGiStore costs 1.01ms and 20.86ms to reading 1
ographically adjacent spatial objects into sequential disk
object and 64 objects, whereas the respective time is 1.12ms
pages, and hence the objects are successively streaming to
and 38.45ms for PostgreSQL cluster. VegaGiStore gains
clients once seeks to the right position. Moreover, VegaGi-
excellent random reads performance due to its geography-
Store can support a large number of concurrent reads across
aware data organization scheme, and hence it could provide
multiple nodes because it preserves geographic proximity
2090
2084
1987
and storage locality. Due to ignorance of geographic prox- I/O throughput and has obvious advantages while bulk
imity and absence of spatial index on HDFS, Cassandra and loading big spatial data.
HBase, they may access too many data blocks across mul-
E. Spatial Query Performance
tiple nodes while reading geographically adjacent objects,
which leads to low sequential reads efficiency. Since key-value stores don’t provide spatial query func-
tions, we compare the spatial queries between VegaGiStore
D. Bulk Loading Operation with two typical spatial databases, i.e.,Postgre+PostGIS and
Since most spatial applications are write once read many Oracle Spatial. The datasets are imported into the three com-
access model [14], the large amounts of spatial data should pared systems, and the spatial indices of spatial objects are
be quickly imported into storage systems for rapid deploy- created as well. Moreover, we have shown the scalability of
ment of spatial information services. Bulk loading operation VegaGiStore on different number of nodes, i.e., VegaGiStore
is often used for batch import of spatial data in practical is evaluated on cluster of 1, 2, 3, 5, 7, 9, 11, 13, 15, 17 nodes
spatial applications, e.g., loading multi-scale spatial data respectively. Besides, each node runs two map tasks and one
across multiple map layers into storage system. reduce tasks in VegaGiStore while executing map-reduce
We have imported three groups of datasets into VegaG- based spatial query jobs.
iStore and compared systems respectively, including Linux 1) Spatial Selection Performance: We have conducted
Ext3(LocalFS), PostgreSQL cluster, bare HDFS, Cassandra three groups of experiments(RQ1, RQ2 and RQ3) to eval-
and HBase. There are two replicas in all systems and the uate the spatial selection performance. First, we create a
HDFS block size is set to 64MB. The three group datasets rectangular region R with its size is 46.53% of the MBR of
include raster data and vector data, and they are classified as HYP dataset; then spatial selection operations is executed in
small(64 GB), medium(512 GB) and large(1024 GB) groups. compared systems to find all the objects of vector datasets
that geometrically interact with R; finally compute and print
the outputs, i.e., the satisfied geometry objects information.
Figure 10. Bulk loading performance.
As shown in Fig.10, the bulk loading time of compared Figure 11. RQ1 finds point objects of TLP within R.
systems is varied with dataset size, and VegaGiStore outper-
forms other systems in all test cases. The spatial selection operation RQ1 is to query all points
Since there are lots of small tiles and geometry objects, the objects of dataset TLP that within region R. As shown
localFS and bare HDFS perform not as well as the other four in Fig.11, when processing RQ1 on 2 to 17 nodes, the
systems. The bulk loading performance of VegaGiStore gets execution time of VegaGiStore is reduces from 159.09s
even better while storing larger dataset. For the small group, to 12.71s, whereas the execution time of PostGIS and
the bulk loading time of VegaGiStore is about 17.6 minutes, Oracle Spatial is 168.72s − 76.32s and 152.21s − 69.93s,
which is 680%, 510%, 597%, 99%, 235% faster than that of respectively. The average speedup ratio of VegaGiStore is
LocalFS, PostgreSQL cluster, bare HDFS, Cassandra and about 75.32%. Moreover, it should be pointed out that the
HBase, respectively. On the other hand, it cost about 261.9 execution time of VegaGiStore is longer than that of SDB
minutes for loading large(1024GB) dataset into VegaGiS- on single node. That is because VegaGiStore depends on
tore, which is about 10.9, 5.13, 6.88, 1.1, 1.36 times faster MapReduce runtime system, and the MapReduce startup is
than compared systems, respectively. Besides, the average a costly process.
I/O throughput of VegaGiStore is about 65.8MB/s, whereas The spatial selection operation RQ2 is to query all lines
the I/O throughput of LocalFS, PostgreSQL cluster, HDFS, objects of dataset TLL that within or intersect with region
Cassandra and HBase is about 6.9, 11.3, 8.9, 32.9, 27.3 M- R. As shown in Fig.12, the average speedup ratio of
B/s, respectively. Therefore, VegaGiStore achieves highest VegaGiStore is about 72.87%, and the execution time is
2091
2085
1988
predicate {(r, s)|r Intersect s, r ∈ S1, s ∈ S2}.
Figure 12. RQ2 finds line objects of TLL interact with R.
Figure 14. Spatial join query evaluation.
reduced from 308.67s to 27.06s, whereas the execution time
of PostGIS and Oracle Spatial is 353.78s − 236.67s and As shown in Fig.14, the spatial join query performance
343.61s − 226.39s with number of nodes increased from 2 of VegaGiStore is much better than that of PostGIS and
to 17, respectively. Oracle Spatial. The execution time of VegaGiStore, Post-
GIS and Oracle Spatial on one node is 1458.39s, 1423.76s
and 1396.58s, respectively. However, the execution time of
VegaGiStore is reduced obviously as the cluster scales, e.g.,
the time is 91.37 s with 17 nodes, whereas the respective
time is 588.69s and 538.69s for PostGIS and Oracle Spatial.
The average speedup ratio of VegaGiStore is about 70.98%
when processing intersection spatial join query. VegaGiStore
performs better with more nodes, thus it could efficiently
process spatial join query involving large datasets.
3) kNN Performance: The kNN query predicate is to find
k objects in TLP dataset that are closet to a query point p.
Figure 13. RQ3 finds polygon objects of HYP interact with R.
The spatial selection operation RQ3 is to query all poly-
gons objects of dataset HYP that interact within or overlap
with region R. As shown in Fig.13, with RQ3 is processed
on 2 node to 17 nodes, the execution time of VegaGiStore
is reduced from 752.89s to 55.37s, whereas the execution
time of PostGIS and Oracle Spatial reduces not so obviously,
i.e., 812.37s − 608.91s and 762.37s − 508.91s, respectively.
Besides, the average speedup ratio of VegaGiStore is about
75.89%.Therefore, VegaGiStore achieves distinguished spa-
tial selection performance and has good scalability. Figure 15. kNN spatial query performance of different systems(k = 1, 10).
2) Spatial Join Performance: Spatial join query combines
objects from two datasets by geometric attributes which We evaluate the kNN spatial query between VegaGiStore
satisfy spatial predicate. We conduct experiment to evaluate and spatial databases (i.e., PostGIS and Oracle Spatial)
the spatial join query, where the spatial predicate is inter- where k = 1 and 10. As shown in Fig.15, VegaGiStore out-
section. Moreover, the intersection join query is processed performs spatial databases running on more than two nodes,
over dataset TLL(lines objects), and it answers query such and its execution time is reduced from 620.98s to 58.17s
as finding roads across rivers in specific area. with nodes increased from 2 to 17, whereas the respective
We select two spatial datasets S1 and S2 with their size is time for PostGIS and Oracle Spatial is 859.28s − 298.67s
30% of TLL. The spatial join performance is evaluated by and 883.79s − 263.79s. Moreover, as shown in Fig.16, the
intersection join operation, i.e., finding objects that satisfy kNN performance of spatial databases decreases rapidly
2092
2086
1989
with larger k, whereas VegaGiStore keeps at a relatively [22] is designed for shared-disk environments. However, the
stable level. Besides, the kNN performance of VegaGiStore spatial index only improves data retrieval efficiency, and they
increases with more nodes, and its average speedup ratio are regardless of I/O throughput and spatial computation
has achieved by about 73.85% when k ranges from 1 to 50. capability. Thus, they cannot achieve high performance
Therefore, VegaGiStore could provide efficient kNN spatial spatial query processing that involves massive spatial data
query for data-intensive spatial applications. and concurrent users.
Query parallelism is an significant issue of query pro-
cessing. Typical parallel databases [23] provides inter-
query and intra-query parallelisms for parallel processing
of structured data. We focus on parallel query processing of
multi-dimensional spatial data, with provision of geographic
proximity, spatial index and spatial query parallelism, our
proposal can achieve high aggregate I/O throughput and
spatial computation capability.
V. C ONCLUSION
We have proposed and implemented a distributed, ef-
ficient and scalable scheme(i.e. VegaGiStore) to provide
multifunctional spatial queries over big spatial data. Firstly,
Figure 16. kNN query performance of VegaGiStore with different k values a geography-aware data organization approach is presented
and # of nodes. to achieve high aggregate I/O throughput. The big spatial
data are partitioned into blocks according to their geographic
IV. R ELATED W ORK space and block size threshold. The adjacent spatial objects
are stored sequentially into SOFile in terms of geographic
There are quite a few early works on spatial query proximity. Secondly, in order to improve data retrieval
processing by integrating spatial index into SDB. They efficiency, we design a two-tier distributed spatial index for
are focus on pruning the search space while processing efficient pruning of the search space. The index consists
queries in Euclidean space [15], e.g., Quadtree [9], R-tree of quadtree-based global index and Hilbert-ordering local
and their variants [16] are integrated into Oracle spatial index, and hence it could improve query efficiency with
[17] and PostGIS [18]. SDB performs well with small low latency access. Thirdly, we propose an “indexing +
spatial dataset [1]. However, limited to the fixed schema MapReduce” data processing architecture to improve the
and strict ACID2 semantics, SDB cannot provide efficient spatial query computation capability of VegaGiStore. This
spatial queries involving big spatial data. architecture takes advantage of data-parallel processing tech-
LDD(Location Dependent Database) is a typical spatial- niques to provide both intra-query parallelism and inter-
tagged database used for location-related data management. query parallelism, and thus can reduce individual spatial
The LDD supports location context-aware information ap- query execution time and afford a large number of concur-
plications in mobile environments [19]. However, LDD only rent spatial queries. We have compared VegaGiStore with the
answer simple location-related attribute queries over small traditional spatial databases (i.e., PostGIS, Oracle spatial)
textual dataset within a local area. and emerging distributed key-value stores (i.e.,Cassandra,
Key-value store systems are emerging with web-scale HBase). The experimental results show that VegaGiStore
data, and they are suitable for managing semi-structured has gained the best spatial query processing performance,
data that can be represented by key-value model. Google’s and thus can meet high performance requirements of data-
Bigtable is used to store the satellite imagery at many dif- intensive spatial applications.
ferent levels of resolution for Google Earth product [6]. The
open-sourced key-value stores such as HBase [7], Cassandra ACKNOWLEDGMENT
[8] are widely used in web applications for storing textual This work is supported by National High Technology
data or images. However, they cannot support efficient Research and Development Program(863 Program) of China
spatial queries due to ignorance of geographic proximity and (Grant No.2011AA120302 and No. 2011AA120300). The
absence of spatial index. work is also funded by The CAS Special Grant for Postgrad-
There are works to improve spatial query processing uate Research, Innovation and Practice. We would like to
through revising traditional spatial indexes in distributed thank the anonymous reviewers for their valuable comments.
environments. [20] and [21] propose solutions to improve
R EFERENCES
spatial queries in peer-to-peer environments; parallel R-tree
[1] C. Yang, D. Wong, Q. Miao, and R. Yang, Advanced Geoin-
2 Atomicity, Consistency, Isolation, Durability formation Science, 1st ed. CRC Press, October 2009.
2093
2087
1990
[2] R. H. Güting, “An introduction to spatial database systems,” [19] D. L. Lee, J. Xu, B. Zheng, and W.-C. Lee, “Data man-
The VLDB Journal, vol. 3, pp. 357–399, October 1994. agement in location-dependent information services,” IEEE
Pervasive Computing, vol. 1, no. 3, pp. 65 – 72, 2002.
[3] M. Egenhofer, “Spatial sql: a query and presentation lan-
guage,” IEEE Transactions on Knowledge and Data Engi- [20] B. Liu, W.-C. Lee, and D. L. Lee, “Supporting complex
neering, vol. 6, no. 1, pp. 86 –95, feb 1994. multi-dimensional queries in p2p systems,” in Proceedings
of the 25th IEEE International Conference on Distributed
[4] S. Shekhar and S. Chawla, Spatial Databases: A Tour, 1st ed. Computing Systems, ser. ICDCS ’05. Washington, DC, USA:
Prentice Hall, June 2003. IEEE Computer Society, 2005, pp. 155–164.
[5] Z. Shubin, H. Jizhong, L. Zhiyong, W. Kai, and X. Zhiyong, [21] E. Tanin, A. Harwood, and H. Samet, “Using a distributed
“Sjmr: Parallelizing spatial join with mapreduce on clusters,” quadtree index in peer-to-peer networks,” The VLDB Journal,
in IEEE International Conference on Cluster Computing, vol. 16, pp. 165–178, April 2007.
2009, pp. 1–8.
[22] I. Kamel and C. Faloutsos, “Parallel r-trees,” SIGMOD Rec.,
[6] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. vol. 21, pp. 195–204, June 1992.
Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber,
“Bigtable: A distributed storage system for structured data,” [23] D. DeWitt and J. Gray, “Parallel database systems: the future
ACM Trans. Comput. Syst., vol. 26, pp. 4:1–4:26, June 2008. of high performance database systems,” Commun. ACM,
vol. 35, pp. 85–98, June 1992.
[7] “Hbase.” [Online]. Available: http://hbase.apache.org
[8] A. Lakshman and P. Malik, “Cassandra: a decentralized
structured storage system,” ACM SIGOPS Operating Systems
Review, vol. 44, pp. 35–40, April 2010.
[9] H. Samet, “The quadtree and related hierarchical data struc-
tures,” ACM Comput. Surv., vol. 16, pp. 187–260, June 1984.
[10] “Hadoop.” [Online]. Available: http://hadoop.apache.org
[11] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The
hadoop distributed file system,” in Proceedings of the 2010
IEEE 26th Symposium on Mass Storage Systems and Tech-
nologies (MSST), ser. MSST ’10. IEEE Computer Society,
2010, pp. 1–10.
[12] J. Dean and S. Ghemawat, “Mapreduce: simplified data
processing on large clusters,” Commun. ACM, vol. 51, pp.
107–113, January 2008.
[13] H. Samet, “The quadtree and related hierarchical data struc-
tures,” ACM Comput. Surv., vol. 16, pp. 187–260, June 1984.
[14] X. Liu, J. Han, Y. Zhong, and C. Han, “Implementing
webgis on hadoop: A case study of improving small file i/o
performance on hdfs,” in IEEE International Conference on
Cluster Computing, 2009, pp. 1–8.
[15] V. Gaede and O. Günther, “Multidimensional access method-
s,” ACM Comput. Surv., vol. 30, pp. 170–231, June 1998.
[16] S. Brakatsoulas, D. Pfoser, and Y. Theodoridis, “Revisiting
r-tree construction principles,” in Advances in Databases and
Information Systems, ser. Lecture Notes in Computer Science,
Y. Manolopoulos and P. NÃavrat, ˛ Eds. Springer Berlin /
Heidelberg, 2002, vol. 2435, pp. 17–24.
[17] R. K. V. Kothuri, S. Ravada, and D. Abugov, “Quadtree and
r-tree indexes in oracle spatial: a comparison using gis data,”
in Proceedings of the 2002 ACM SIGMOD international
conference on Management of data, ser. SIGMOD ’02. New
York, NY, USA: ACM, 2002, pp. 546–557.
[18] “Postgis.” [Online]. Available: http://postgis.refractions.net/
2094
2088
1991