Towards parallel spatial query processing for big spatial data

Yunqin Zhong

doi:10.1109/IPDPSW.2012.245

Outline

Towards parallel spatial query processing for big spatial data

Yunqin Zhong

https://doi.org/10.1109/IPDPSW.2012.245

Uploaded (2022)

visibility

…

description

10 pages

link

1 file

Abstract

Abstract—In recent years, spatial applications have become more and more important in both scientific research and in-dustry. Spatial query processing is the fundamental functioning component to support spatial applications. However, the state-of-the-art techniques of spatial query processing are facing significant challenges as the data expand and user accesses increase. In this paper we propose and implement a novel scheme (named VegaGiStore) to provide efficient spatial query processing over big spatial data and numerous concurrent user queries. Firstly, a geography-aware approach is proposed to organize spatial data in terms of geographic proximity, and this approach can achieve high aggregate I/O throughput. Secondly, in order to improve data retrieval efficiency, we design a two-tier distributed spatial index for efficient pruning of the search space. Thirdly, we propose an “indexing + MapReduce ” data processing architecture to improve the computation capability of spatial qu...

Figures (11)

Figure 6. Hierarchical structure for spatial data on HDFS(K = 4)

Figure 7. Spatial query processing architecture of VegaGiStore. The filter phase searches the global index and outputs candidate SOFile sets; then these candidate sets are processed in parallel by a map-reduce job at the refinement phase.

Figure 9. Sequential reads performance. 2) Sequential Reads Operation: We also conduct six groups of test with size from R(20,20) to R(80,80) for sequential reads evaluation, and then each test case is repeated for 10 times, finally collect the average results.

Figure 10. Bulk loading performance. We have imported three groups of datasets into VegaG- iStore and compared systems respectively, including Linux Ext3(LocalFS), PostgreSQL cluster, bare HDFS, Cassandra and HBase. There are two replicas in all systems and the HDFS block size is set to 64MB. The three group datasets include raster data and vector data, and they are classified as small(64 GB), medium(512 GB) and large(1024 GB) groups.

Figure 11. RQI1 finds point objects of TLP within R. 1) Spatial Selection Performance: We have conducted three groups of experiments(RQ1, RQ2 and RQ3) to eval- uate the spatial selection performance. First, we create a rectangular region R with its size is 46.53% of the MBR of HYP dataset; then spatial selection operations is executed in compared systems to find all the objects of vector datasets that geometrically interact with R; finally compute and print the outputs, i.e., the satisfied geometry objects information.

Figure 12. RQ2 finds line objects of TLL interact with R.

Figure 13. RQ3 finds polygon objects of HYP interact with R. reduced from 308.67s to 27.06s, whereas the execution time of PostGIS and Oracle Spatial is 353.78s — 236.67s and 343.615 — 226.39s with number of nodes increased from 2 to 17, respectively.

Figure 15. kNN spatial query performance of different systems(k = 1, 10). 3) kNN Performance: The kNN query predicate is to find k objects in TLP dataset that are closet to a query point p.

predicate {(r,s)|r Intersect s,r € S1,s € S2}. Figure 14. Spatial join query evaluation.

with larger k, whereas VegaGiStore keeps at a relatively stable level. Besides, the KNN performance of VegaGiStore increases with more nodes, and its average speedup ratio has achieved by about 73.85% when & ranges from 1 to 50. Therefore, VegaGiStore could provide efficient KNN spatial query for data-intensive spatial applications. Figure 16. kNN query performance of VegaGiStore with different k values and # of nodes.

2012 IEEE 201226th IEEEInternational 26th International ParallelParallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum Towards Parallel Spatial Query Processing for Big Spatial Data Yunqin Zhong1 2 ∗ , Jizhong Han1 , Tieying Zhang1 2 , Zhenhua Li3 , Jinyun Fang1 , Guihai Chen4 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 Graduate University of Chinese Academy of Sciences, Beijing, China 3 Peking University, Beijing, China 4 Shanghai Jiaotong University, Shanghai, China ∗ Corresponding author, e-mail: [email protected] Abstract—In recent years, spatial applications have become value of alphanumeric attributes but also on the spatial more and more important in both scientiﬁc research and in- location, extent and measurements of spatial objects in a dustry. Spatial query processing is the fundamental functioning reference system. Therefore, spatial query processing over component to support spatial applications. However, the state- of-the-art techniques of spatial query processing are facing big spatial data requires intensive disk I/O accesses and signiﬁcant challenges as the data expand and user accesses spatial computation. increase. In this paper we propose and implement a novel The state-of-the-art techniques of spatial query processing scheme (named VegaGiStore) to provide efﬁcient spatial query mainly include SDB (spatial database) [2] and KVS (key- processing over big spatial data and numerous concurrent user value stores). SDB provides spatial query language (i.e. queries. Firstly, a geography-aware approach is proposed to organize spatial data in terms of geographic proximity, and this spatial SQL) [3], and performs well when handling relatively approach can achieve high aggregate I/O throughput. Secondly, small spatial datasets in megabytes or gigabytes [4]. How- in order to improve data retrieval efﬁciency, we design a two- ever, since spatial queries are usually both I/O intensive and tier distributed spatial index for efﬁcient pruning of the search computing intensive, e.g., a single query may take minutes or space. Thirdly, we propose an “indexing + MapReduce” data even hours in SDB [5], the I/O and computation capabilities processing architecture to improve the computation capability of spatial query. Performance evaluations of the real-deployed of SDB can hardly meet the high performance requirement VegaGiStore system conﬁrm its effectiveness. of spatial queries over big spatial data. The emerging KVS systems, such as Bigtable [6], HBase [7] and Cassandra Keywords-spatial data management; distributed storage; s- patial index; spatial query; spatial applications; [8], are proved to be feasible alternatives to store big semi- structured data for its scalability. They has been adopted in some I/O intensive applications, e.g., Bigtable has been used I. I NTRODUCTION to store satellite imagery for Google Earth [6]. However, In recent years, spatial applications such as Web-based the data in key-value stores are organized regardless of Geographical Information System (WebGIS) and Location- geographic proximity, and they are indexed by key-based Based Social Networking Services (LBSNS) have become structure (e.g., B+ tree) rather than spatial index. Therefore, more and more important in both scientiﬁc research and KVS cannot process spatial queries efﬁciently. industry. Spatial query processing is the fundamental func- Driven by the above problems, in this paper we propose tion component to support spatial applications. However, and implement a novel scheme (named VegaGiStore) to the state-of-the-art techniques of spatial query processing provide efﬁcient spatial query processing over big spatial are facing signiﬁcant challenges as the data expand and data and numerous concurrent user queries. Firstly and most user accesses increase [1]. With the development of earth importantly, we propose a geography-aware data organiza- observation technologies, the spatial data are growing ex- tion approach to achieve high aggregate I/O throughput. The ponentially year by year (currently in a petabytes scale), big spatial data are partitioned into blocks according to the and their categories are becoming more diverse including geographic space and block size threshold 1 , and these blocks multi-dimensional geographic data, multi-spectrum remote are uniformly distributed on cluster nodes. Then the geo- sensing imageries, high-resolution aerial photographs, and graphically adjacent spatial objects are stored sequentially so on. Besides, as spatial applications become more popular, in terms of space ﬁlling curve which could preserve the concurrent user accesses to spatial applications are becoming geographic proximity of spatial objects. In practical spatial highly intensive. applications, most clients only focus on a relatively small The spatial data objects are generally nested and more area and query for adjacent spatial objects within the area. complex than basic data types(e.g., string). They are stored Thereby concurrent clients can be served in parallel by as multi-dimensional geometry objects, e.g., points, lines 1 The block size threshold is the maximum size of a block. The partition- and polygons. Moreover, the spatial query predicates are ing process does not ﬁnish until the total size of spatial objects within a complex. Typical spatial queries are based not only on the partitioned region is smaller than the threshold. 978-0-7695-4676-6/12 $26.00 © 2012 IEEE 1982 2079 2085 DOI 10.1109/IPDPSW.2012.245 different cluster nodes and adjacent spatial objects can be II. PARALLEL S PATIAL Q UERY P ROCESSING S CHEME streamed to clients sequentially without random I/O seeks. A. Geography-aware Spatial Data Organization Approach Secondly, in order to improve data retrieval efﬁciency, 1) Spatial Data Partitioning: We propose a geography- we design a two-tier distributed spatial index for efﬁcient aware quadripartition method to partition a large map layer. pruning of the search space. The index consists of Quadtree- The scheme is designed to guarantee that data within a based [9] global index and Hilbert-ordering local index, partitioned region are stored on one node, and all spatial where the global index is used to ﬁnd data blocks and local data are distributed across cluster according to geograph- index is used to locate spatial objects. ical space. Spatial data objects are logically or physically Thirdly, we propose an “indexing + MapReduce” data organized in multi-scale map layers. Spatial object has three processing architecture to improve the spatial query compu- attributes: ID(identiﬁer), MBR(Minimum Bounding Rectan- tation capability. This architecture takes advantage of data- gle) and object value. A map layer also has three attributes: parallel processing techniques to provide both intra-query unique name, MBR and resolution. MBR is an expression parallelism and inter-query parallelism, and thus can reduce of the maximum extents of a 2-dimensional spatial object. individual spatial query execution time and afford a large MBR is frequently used as an indication of the general number of concurrent spatial queries. position of spatial object, and it is used as spatial metadata We have implemented VegaGiStore on top of Hadoop for ﬁrst-approximation spatial query and spatial indexing [10], an emerging open-sourced cloud platform. VegaGiS- purpose. Therefore, Spatial applications could access spatial tore can support numerous concurrent spatial queries for data within different regions from different nodes to provide various spatial applications like Web Mapping Services spatial information services for numerous users. (WMS), Web Feature Services (WFS) and Web Coverage Service (WCS) [1]. Compared with the traditional meth- Input: Region(i.e., MBR of map layer) ods, VegaGiStore improves the average speed-up ratio by Output: Partitioned Subregions 1: Initiate(region) 70.98% − 75.89% when processing spatial queries on a 17- 2: M SIZE ← 64M B node cluster, and its average spatial query performance is in- 3: {0, 1, 2, 3} ← {N W, N E, SE, SW } creased by about 10.3−13.5 times better than that of single- 4: Boolean isValid ← Verify(region) 5: if isValid then node spatial databases. Moreover, its average I/O throughput 6: for i=0 to 3 do is improved by 99%−235% than that of compared key-value 7: subregion[i] ← Partition(region) stores. In summation, our contributions in this paper can be 8: end for 9: else summarized as follows: 10: exit(0) 11: end if 1) We present a feasible scheme for efﬁcient processing 12: for i=0 to 3 do of spatial queries over big spatial data. We tack- 13: Verify_Partition(subregion[i]) le the problem through three signiﬁcant approaches: 14: end for geographical-aware organization approach for high Figure 1. Verify_Partition(region). Procedure of partitioning a region I/O throughput; two-tier distributed spatial index for data retrieval efﬁciency; “indexing + MapReduce” The partitioning process is described as follows, and the spatial querying architecture for parallel processing. respective pseudo code is shown in Fig.1. Our scheme can be easily integrated into a cloud 1) Input and Initialization Process. Input a map lay- computing platform (e.g., Hadoop [10]) to support er/region and compute the size of objects including parallel spatial query processing. the real data size and additional indices size. 2) We have implemented a spatial data management 2) Veriﬁcation Process. If the region size is larger than system termed VegaGiStore on top of HDFS (Hadoop threshold M SIZE, then set isValid ﬂag to “TRUE”, Distributed File System) [11] and MapReduce frame- and go to Step 3; else the region size is not smaller work [12]. VegaGiStore provides multifunctional spa- than M SIZE, set isValid to “FALSE” to indicate that tial queries which most key-value store systems do it need not to be partitioned. not have, and it is transparent to spatial applications. 3) Partition Process. Partitioning the region into four Besides, the system evaluations show that VegaGiStore quadrants according to its MBR, and each quadrant outperforms spatial databases and emerging key-value represents one subregion. stores while processing concurrent spatial queries from 4) Computing the size of four subregions respectively, numerous clients in practical spatial applications. and go to Step 2 to verify each subregion recursively The rest of the paper is organized as follows. Section II and determine whether the subregion should be further details the parallel spatial query processing scheme. Section partitioned or not. III presents the performance evaluation. Section IV reviews 5) The partitioned process will be executed recursively the related work. Finally, Section V concludes this paper. until all subregions are not larger than M SIZE. 2086 2080 1983 6) If all partitioned regions satisﬁed valid requirements, !" return “0”; else terminate the partition procedure. Ă Ă Ă # According to the principle of geographic proximity, s- patial objects within a region are combined into one data block, so the threshold size M SIZE should be set as large Figure 4. Structure of SOFile for raster data model. SOFileRaster is designed for raster data placement, which contains local indices header as the HDFS block size in order to guarantee spatial data and raster objects. within a region are stored on the same node, typically set to 64MB, and it can be varied according to dataset amounts and cluster scale. Otherwise, the spatial data within a region Each SOFile consists of geographically adjacent spatial may be stored on more than one node, which will reduce objects within a speciﬁc subregion, and one SOFile oc- data retrieval performance. cupies one data block. Since there are two convention- According to the partition procedure, three deductions are al spatial data models in spatial applications, we have described as follows. design two different structures of SOFile for raster tiles • Let κ denote the size of square region, and there are and vector geometry objects, respectively. The structure of 2κ × 2κ spatial objects in the region whose size is κ. SOFile for raster data model is shown in Fig.4, which is • The upper-left point is deﬁned as the ﬁrst object of called SOFileRaster. Moreover, Fig.5 shows the structure of region. SOFile(termed SOFileVector) for vector data model. • The ﬁrst κ bits of coordinate(x,y) of the ﬁrst ob- $% &' ( )% ject are “0”, i.e., x = xn · · · xκ 00 · · · 0 and y = * + &' ( &' ( &' ( &' ( !" yn · · · yκ 00 · · · 0, where n denotes size of the parent ( Ă ( Ă ( ( region. The higher (n − κ) bits of coordinates of &' ( &' ( ( Ă ( # objects within the region are identical, which is deﬁned as region code, i.e., region code is represented as Figure 5. Structure of SOFile for vector data model. SOFileVector is (yn xn )(yn−1 xn−1 ) · · · (yκ xκ ). designed for vector geometry object placement, which contains local indices header and WKB objects. Fig.2 shows an example of partitioning a region by quadripartition scheme. The region size κ = 4, its sub- Both SOFileRaster and SOFileVector are inherited from regions are represented by solid line squares, it contains SOFile structure, which includes local index header and 24 × 24 = 256 spatial objects which represented by dotted real data part. Since the raster data and vector data have square. different function for spatial queries, we design different index structure for the two spatial data models. The local index header is the main distinction between SOFileRaster and SOFileVector, which will be described in SectionII-B2. The local index header contains meta data information of block and index items of spatial objects; the data content part contains real data of spatial objects within the region. Moreover, the size of SOFile is the sum length of indices and real data part. The spatial objects are organized in Hilbert order and assigned unique HC(Hilbert Code), and adjacent Figure 2. Quadripartition Figure 3. Hilbert-order storage. spatial objects are stored on sequential disk pages so that it can guarantee geographic proximity and storage locality. 2) SOFile: We design a spatial objects placement struc- Fig.3 shows an example that spatial objects within region ture termed SOFile(Spatial Object File). The SOFile is creat- R31 are stored in Hilbert order. ed during partitioned process, and the spatial objects within The leaf node of global index tree is pointed to a data a subregion are stored in a SOFile named by the subregion’s block ﬁle whose sufﬁx is “.sof”(spatial object ﬁle) on GC value. Moreover, the raster data are stored as tile objects HDFS, and the non-leaf node represents a region that should in SOFile, whereas the vector data are stored as WKB(Well- be partitioned into four smaller subregions for its size is Known Binary) objects. Taking geographic proximity into larger than threshold M SIZE. Fig.6 shows an example consideration, the geographically adjacent objects should of hierarchical directory structure details of region(κ = 4) stored in sequential disk pages. Spatial objects within a stored on HDFS, which is corresponding to quadripartition partitioned subregion are stored into the SOFile by space schematic shown in Fig.2. The ellipse represents storage ﬁlling curve, and they are organized in Hilbert order instead directory corresponding to non-leaf node, and rectangle of Row-wise order or Z order because it has better locality- represents data block ﬁle corresponding to leaf node. HDFS preserving property [1]. creates one block for each ﬁle, and ﬁle blocks are distributed 2087 2081 1984 ,-./ ,-, where s and κ denote the size of region and its subre- gions, (x, y) denotes the coordinates of objects, and ci ∈ {0, 1, 2, 3}. ) % s 3 4 4 5 GC = (2yκ + xκ ) × 4s−κ (1) κ=1 334 34 34 354 53 54 54 554 - 5334 4- According to (1), each region has an unique GC value 0"""1 5334 534 534 5354 #, 2#, * used to construct global index. As shown in Fig.2, the #, #, 6 #, Ă #, #, , quaternary numerics denote GC values of regions, e.g., 0""" !" # 0 !" # 0 !" # 0 !" # 0 !" # region R300 = 303, R301 = 301, we can derive that the ĂĂ 0 ĂĂ 0 ĂĂ 0 ĂĂ 0""" ĂĂ 0 GC value of their parent node is 30. !" # !" # !" # !" # !" # Since the non-leaf node of global index tree only pointed by its GC value, the size of global tree is very small and the Figure 6. Hierarchical structure for spatial data on HDFS(κ = 4) global index is resident in memory during retrieval process. Besides, <GC,MBR> pairs of regions are maintained in the HashMap structure, which are used to obtain MBR across cluster nodes for load balancing. As shown in Fig.6, information for further spatial query computation. the directory hierarchy is quadtree-like structure, the root 2) Local Index: The local index is created when subre- node of quadtree represents the root directory identiﬁed gion data is written into SOFile, and indices data are stored by “Global Code”, and its four children nodes represent in the SOFile as well. Therefore, the leaf nodes of global subdirectories and “.sof” ﬁles. quadtree are pointed to the header of spatial object ﬁle. B. Two-tier Distributed Spatial Index The local index is used for indexing spatial objects within SOFile, and the local index header is illustrated as follows. The VegaGiStore system must be able to retrieve from a • Metadata information. For the SOFile structure, the large collection of objects in some space those lying within a particular area without scanning the whole datasets, so the ﬁrst word is reserved for data version; the second and spatial index is mandatory. In order to improve spatial data third words are (x, y) coordinate of 1st object; the access performance and optimize spatial queries, we propose fourth word is the κ value of the region; the region a scalable distributed spatial index to accelerate positioning is determined by its κ value and coordinate(x,y) of spatial objects on HDFS. Considering geographic proximity the ﬁrst tile object while processing raster data. For and storage locality, the geographically adjacent data should the SOFileVector structure, the ﬁrst four words are be stored into the same node. MBR(Minimum Bounding Rectangle) information of Our proposed distributed spatial index is a two-tier scal- the region represented by four double values; the ﬁfth able index including global index and local index. There are word is HC(Hilbert Code) value of the ﬁrst WKBobject; two salient features of the spatial index. The global index is the sixth and the seventh word is GC value and κ value based on the revised distributed quadtree index [13], which of region, respectively. • Index item. The index item contains two ﬁelds: offset is used to determine the data block location. The local index is built by space ﬁlling curve and is used to locate spatial and length. It means that local index of each spatial objects within a block. Moreover, the distributed index is object is corresponding to a <offset, length>pair, and designed and tuned for spatial applications, which is oriented the index items of spatial objects are written into block to improve spatial data retrieval efﬁciency on HDFS. sequentially. κ κ • Indices length. There are 2 × 2 objects, and index 1) Global Index: The global quadtree index is created length of object is 8 bytes, so the total length of during quadripartition process. The large map layer is par- ﬁle indices is 22κ+3 bytes. Thus the index length of titioned into four quadrants recursively until all subregions SOFileRaster and SOFileVector is (22κ+3 + 12) bytes are satisﬁed the threshold. Meanwhile, all spatial objects and (22κ+3 + 24) bytes, respectively. belong to the map layer is partitioned according to their geographical space, and adjacent objects are sequentially C. “Indexing+MapReduce” Data Processing Architecture stored into a SOFile. Once a large map layer is split into several subregions, the spatial data are partitioned into many We propose an “indexing + MapReduce" data processing data blocks and uniformly spread across HDFS DataNodes. architecture to improve the spatial query computation ca- The global index is quadtree-based, and the glob- pability of VegaGiStore. This architecture takes advantage al tree structure is represented by Global Code (GC). of data-parallel processing techniques to provide both intra- GC is quaternary code, where GC = c1 c2 , · · · , cs = query parallelism and inter-query parallelism, and thereby y1 x1 y2 x2 , · · · , ys xs . GC value can be computed by (1), can reduce individual spatial query execution time and 2088 2082 1985 provide a large number of concurrent spatial queries. Our transferred to SpatialQueryMapper and they are par- scheme is speciﬁc to spatial queries including spatial se- allel processed by TaskTrackers on the cluster nodes. lection, spatial join and nearest neighbors, and the spatial This process obtains the <ID,WKBobject> pairs that queries are processing in multiple phases. The ﬁrst ﬁlter satisfying the query conditions. phase prunes non-qualiﬁed objects with spatial index to • Reduce task. the satisﬁed <ID,WKBobject> pairs are obtain candidate intermediate sets, and then the qualiﬁed transferred to Reducer. SpatialQueryReducer executes candidate objects are transferred as the input of reﬁnement the complex spatial relationship computation for the phase. Finally the spatial relation computation examines the ﬁnal query results. actual object representation to determine the query results. 1) MapReduce-based Spatial Query Operator: In VegaG- ) !" iStore, we have implemented several spatial query operators using the map/reduce paradigm. The spatial query operators !" !" !" are classiﬁed into three categories: spatial selection, spatial .67 .67 .67 join and NN(Nearest Neighbor). Moreover, the spatial se- lection queries contain point query, range query and region .67!/./ .67!/./ .67!/./ query, where the region query includes rectangle query, cir- .67 .67 .67 cle query and polygon query. Besides, the NN query consists of k-NN(k-Nearest Neighbor). In addition, the spatial query ./89*// ./89*// ./89*// algorithms are encapsulated into spatial query operators, and ./89 ./89 these operators are packaged as map/reduce spatial query library. Therefore, an arbitrary complex spatial query can 89 be implemented by a combination of these query operators. 2) Parallel Execution of Spatial Query: Our scheme takes Figure 7. Spatial query processing architecture of VegaGiStore. The ﬁlter phase searches the global index and outputs candidate SOFile sets; then advantage of data-parallel processing techniques so that it these candidate sets are processed in parallel by a map-reduce job at the could provide both inter-query parallelism and intra-query reﬁnement phase. parallelism. The inter-query parallelism is obtained by paral- lel executing multiple spatial queries as independent jobs so Since complex spatial query can be combined by several that it can support a large number of concurrent clients. The spatial query operators, and these operators are map-reduce intra-query parallelism can be obtained by parallel execution based, the complex spatial query can be executed in parallel of two independent phases within an individual spatial query. on many nodes. Besides, a large number of concurrent As shown in Fig.7, the spatial query are processing in two spatial queries can be executed simultaneously. Therefore, phases, which includes ﬁlter phase and the reﬁnement phase. VegaGiStore could achieve high throughput performance for The ﬁlter phase searches the global index and obtains the spatial query processing over big spatial data. candidate SOFile sets, and these candidates are parallel III. P ERFORMANCE E VALUATION processed by a map-reduce job at the reﬁnement phase. The details of spatial query execution in VegaGiStore are A. Experiment Environment described as follows. Our experiments are conducted on a cluster of 17 com- Firstly, the ﬁlter operation prunes non-qualiﬁed spatial modity servers that spread across two racks(i.e., RACK1 objects simultaneously by searching the global index, and & RACK2). RACK1 consists of 8 nodes, and each node returns the candidate SOFile sets. Since the global index has two quad-core intel CPU 2.13GHZ, 4GB DDR3 RAM, is kept in memory and retrieved by GC(Global Code) of 15000r/min SAS 300GB hard disk. RACK2 consists of 9 global quadtree, the ﬁlter phase will be ﬁnished in several nodes and each node has a Intel Pentium 4 CPU 2.8GHz, milliseconds. The outputs of this phase are GC values of 2GB DDR2 RAN, 7200r/min SATA 80GB hard disk. All SOFiles that matches the query requirements, and the can- nodes are connected through Gigabit Ethernet switchers. didate SOFile sets are used as the input of next reﬁnement Software conﬁgurations are detailed as follows. All nodes phase for further computation. have identical CentOS 5.5 server edition (kernel 2.6.18), Lin- Secondly, the candidate SOFile sets are interpreted into ux Ext3 and JDK-1.6.0_20. PostgreSQL-9.0.5 cluster, bare <ID,object> pairs and processed by a map-reduce job at Hadoop-0.20.2, Cassandra-0.7.6, HBase-0.20.6 and VegaGi- the reﬁnement phase. Since the map-reduce framework Store are deployed on the cluster. Moreover, Zookeeper-3.3.3 relies on the InputSplit and RecordReader, we implemen- is deployed on 7 nodes to maintain conﬁguration information t SOFileInputSplit and SOFileRecordReader to generate and distributed synchronization. Besides, we also deploy <ID,WKBobject> pairs for Mapper. The map and reduce two spatial databases in RACK1, i.e., commercial Oracle procedures are described as follows. Spatial + Oracle database cluster and open-sourced PostGIS • Map task. The generated <ID,WKBobject> pairs are + PostgreSQL cluster. 2089 2083 1986 B. Test Items and Datasets As already mentioned, spatial queries should process large amounts of spatial data, and the spatial query efﬁciency is heavily depended on both I/O and spatial computation performance, hence we evaluate spatial query performance in terms of two categories, including I/O metrics and spatial query metrics. We evaluate the I/O performance by three frequently-used I/O operations in spatial applications, which includes random reads, sequential reads and bulk loading. Besides, the spatial query efﬁciency is evaluated by conven- tional operations, including spatial selection query, spatial join and k-NN query. The real spatial dataset is about 1.379TB and consists of Figure 8. Random reads performance. raster and vector datasets, which covers eight map scales with highest resolution is 1 : 5000. The raster dataset contains about 128, 323, 657 ﬁle-based tiles, and each tile low latency random access for spatial applications involving ranges from several bytes to tens of KBs. The vector dataset a large number of concurrent reads. consists of geometry objects: (a) TLP contains 314, 851, 774 2) Sequential Reads Operation: We also conduct six point objects; (b) TLL contains 81, 991, 436 line objects; (c) groups of test with size from R(20, 20) to R(80, 80) for HYP contains 16, 749, 181 polygon objects. sequential reads evaluation, and then each test case is C. Reads Operations repeated for 10 times, ﬁnally collect the average results. We evaluate two reads operations: random reads and sequential reads, which are used in different application scenarios. Random reads operation is often used for random access of spatial objects within a small region, e.g., reading the spatial object of given location <longitude, latitude>; sequential reads operation is used to sequentially access adjacent spatial objects within a map layer, e.g., reading all geometry objects within speciﬁc map layer. Let R(lon, lat) denote that reading (lon × lat) spatial objects within region R, e.g., R(1, 1) means reading one ob- ject, and R(80, 80) means reading 6400 spatial objects. We conduct six groups of comparative experiments for random reads and sequential reads, respectively. The comparisons are VegaGiStore and four other typical systems, including Figure 9. Sequential reads performance. PostgreSQL cluster, bare HDFS, Cassandra and HBase. 1) Random Reads Operation: The random reads perfor- As shown in Fig.9, the average sequential reads perfor- mance is evaluated by reading spatial objects with size from mance of VegaGiStore is about 198%, 856%, 336%, 309% R(1, 1) to R(8, 8). better than that of PostgreSQL cluster, bare HDFS, Cas- As shown in Fig.8, the average random reads performance sandra and HBase, respectively. Moreover, the VegaGiStore of VegaGiStore is increased by about 79%, 338%, 96%, 89% performs better when reading more geographically adjacent than that of PostgreSQL cluster, bare HDFS, Cassandra and spatial objects, e.g., it cost only 112ms when reading 6400 HBase, respectively. spatial objects from VegaGiStore, yet the respective time is Since bare HDFS is only tuned for streaming large ﬁles, it 523ms, 1187ms, 593ms and 583ms for PostgreSQL cluster, performs worst while randomly reading small spatial objects. bare HDFS, Cassandra and HBase. PostgreSQL cluster performs better than key-value stores be- VegaGiStore outperforms compared systems in reads cause it has spatial index. Moreover, VegaGiStore performs micro-benchmarks because it beneﬁts from geography-aware even better while randomly reading more spatial objects, data organization scheme. VegaGiStore organizes the ge- e.g., VegaGiStore costs 1.01ms and 20.86ms to reading 1 ographically adjacent spatial objects into sequential disk object and 64 objects, whereas the respective time is 1.12ms pages, and hence the objects are successively streaming to and 38.45ms for PostgreSQL cluster. VegaGiStore gains clients once seeks to the right position. Moreover, VegaGi- excellent random reads performance due to its geography- Store can support a large number of concurrent reads across aware data organization scheme, and hence it could provide multiple nodes because it preserves geographic proximity 2090 2084 1987 and storage locality. Due to ignorance of geographic prox- I/O throughput and has obvious advantages while bulk imity and absence of spatial index on HDFS, Cassandra and loading big spatial data. HBase, they may access too many data blocks across mul- E. Spatial Query Performance tiple nodes while reading geographically adjacent objects, which leads to low sequential reads efﬁciency. Since key-value stores don’t provide spatial query func- tions, we compare the spatial queries between VegaGiStore D. Bulk Loading Operation with two typical spatial databases, i.e.,Postgre+PostGIS and Since most spatial applications are write once read many Oracle Spatial. The datasets are imported into the three com- access model [14], the large amounts of spatial data should pared systems, and the spatial indices of spatial objects are be quickly imported into storage systems for rapid deploy- created as well. Moreover, we have shown the scalability of ment of spatial information services. Bulk loading operation VegaGiStore on different number of nodes, i.e., VegaGiStore is often used for batch import of spatial data in practical is evaluated on cluster of 1, 2, 3, 5, 7, 9, 11, 13, 15, 17 nodes spatial applications, e.g., loading multi-scale spatial data respectively. Besides, each node runs two map tasks and one across multiple map layers into storage system. reduce tasks in VegaGiStore while executing map-reduce We have imported three groups of datasets into VegaG- based spatial query jobs. iStore and compared systems respectively, including Linux 1) Spatial Selection Performance: We have conducted Ext3(LocalFS), PostgreSQL cluster, bare HDFS, Cassandra three groups of experiments(RQ1, RQ2 and RQ3) to eval- and HBase. There are two replicas in all systems and the uate the spatial selection performance. First, we create a HDFS block size is set to 64MB. The three group datasets rectangular region R with its size is 46.53% of the MBR of include raster data and vector data, and they are classiﬁed as HYP dataset; then spatial selection operations is executed in small(64 GB), medium(512 GB) and large(1024 GB) groups. compared systems to ﬁnd all the objects of vector datasets that geometrically interact with R; ﬁnally compute and print the outputs, i.e., the satisﬁed geometry objects information. Figure 10. Bulk loading performance. As shown in Fig.10, the bulk loading time of compared Figure 11. RQ1 ﬁnds point objects of TLP within R. systems is varied with dataset size, and VegaGiStore outper- forms other systems in all test cases. The spatial selection operation RQ1 is to query all points Since there are lots of small tiles and geometry objects, the objects of dataset TLP that within region R. As shown localFS and bare HDFS perform not as well as the other four in Fig.11, when processing RQ1 on 2 to 17 nodes, the systems. The bulk loading performance of VegaGiStore gets execution time of VegaGiStore is reduces from 159.09s even better while storing larger dataset. For the small group, to 12.71s, whereas the execution time of PostGIS and the bulk loading time of VegaGiStore is about 17.6 minutes, Oracle Spatial is 168.72s − 76.32s and 152.21s − 69.93s, which is 680%, 510%, 597%, 99%, 235% faster than that of respectively. The average speedup ratio of VegaGiStore is LocalFS, PostgreSQL cluster, bare HDFS, Cassandra and about 75.32%. Moreover, it should be pointed out that the HBase, respectively. On the other hand, it cost about 261.9 execution time of VegaGiStore is longer than that of SDB minutes for loading large(1024GB) dataset into VegaGiS- on single node. That is because VegaGiStore depends on tore, which is about 10.9, 5.13, 6.88, 1.1, 1.36 times faster MapReduce runtime system, and the MapReduce startup is than compared systems, respectively. Besides, the average a costly process. I/O throughput of VegaGiStore is about 65.8MB/s, whereas The spatial selection operation RQ2 is to query all lines the I/O throughput of LocalFS, PostgreSQL cluster, HDFS, objects of dataset TLL that within or intersect with region Cassandra and HBase is about 6.9, 11.3, 8.9, 32.9, 27.3 M- R. As shown in Fig.12, the average speedup ratio of B/s, respectively. Therefore, VegaGiStore achieves highest VegaGiStore is about 72.87%, and the execution time is 2091 2085 1988 predicate {(r, s)|r Intersect s, r ∈ S1, s ∈ S2}. Figure 12. RQ2 ﬁnds line objects of TLL interact with R. Figure 14. Spatial join query evaluation. reduced from 308.67s to 27.06s, whereas the execution time of PostGIS and Oracle Spatial is 353.78s − 236.67s and As shown in Fig.14, the spatial join query performance 343.61s − 226.39s with number of nodes increased from 2 of VegaGiStore is much better than that of PostGIS and to 17, respectively. Oracle Spatial. The execution time of VegaGiStore, Post- GIS and Oracle Spatial on one node is 1458.39s, 1423.76s and 1396.58s, respectively. However, the execution time of VegaGiStore is reduced obviously as the cluster scales, e.g., the time is 91.37 s with 17 nodes, whereas the respective time is 588.69s and 538.69s for PostGIS and Oracle Spatial. The average speedup ratio of VegaGiStore is about 70.98% when processing intersection spatial join query. VegaGiStore performs better with more nodes, thus it could efﬁciently process spatial join query involving large datasets. 3) kNN Performance: The kNN query predicate is to ﬁnd k objects in TLP dataset that are closet to a query point p. Figure 13. RQ3 ﬁnds polygon objects of HYP interact with R. The spatial selection operation RQ3 is to query all poly- gons objects of dataset HYP that interact within or overlap with region R. As shown in Fig.13, with RQ3 is processed on 2 node to 17 nodes, the execution time of VegaGiStore is reduced from 752.89s to 55.37s, whereas the execution time of PostGIS and Oracle Spatial reduces not so obviously, i.e., 812.37s − 608.91s and 762.37s − 508.91s, respectively. Besides, the average speedup ratio of VegaGiStore is about 75.89%.Therefore, VegaGiStore achieves distinguished spa- tial selection performance and has good scalability. Figure 15. kNN spatial query performance of different systems(k = 1, 10). 2) Spatial Join Performance: Spatial join query combines objects from two datasets by geometric attributes which We evaluate the kNN spatial query between VegaGiStore satisfy spatial predicate. We conduct experiment to evaluate and spatial databases (i.e., PostGIS and Oracle Spatial) the spatial join query, where the spatial predicate is inter- where k = 1 and 10. As shown in Fig.15, VegaGiStore out- section. Moreover, the intersection join query is processed performs spatial databases running on more than two nodes, over dataset TLL(lines objects), and it answers query such and its execution time is reduced from 620.98s to 58.17s as ﬁnding roads across rivers in speciﬁc area. with nodes increased from 2 to 17, whereas the respective We select two spatial datasets S1 and S2 with their size is time for PostGIS and Oracle Spatial is 859.28s − 298.67s 30% of TLL. The spatial join performance is evaluated by and 883.79s − 263.79s. Moreover, as shown in Fig.16, the intersection join operation, i.e., ﬁnding objects that satisfy kNN performance of spatial databases decreases rapidly 2092 2086 1989 with larger k, whereas VegaGiStore keeps at a relatively [22] is designed for shared-disk environments. However, the stable level. Besides, the kNN performance of VegaGiStore spatial index only improves data retrieval efﬁciency, and they increases with more nodes, and its average speedup ratio are regardless of I/O throughput and spatial computation has achieved by about 73.85% when k ranges from 1 to 50. capability. Thus, they cannot achieve high performance Therefore, VegaGiStore could provide efﬁcient kNN spatial spatial query processing that involves massive spatial data query for data-intensive spatial applications. and concurrent users. Query parallelism is an signiﬁcant issue of query pro- cessing. Typical parallel databases [23] provides inter- query and intra-query parallelisms for parallel processing of structured data. We focus on parallel query processing of multi-dimensional spatial data, with provision of geographic proximity, spatial index and spatial query parallelism, our proposal can achieve high aggregate I/O throughput and spatial computation capability. V. C ONCLUSION We have proposed and implemented a distributed, ef- ﬁcient and scalable scheme(i.e. VegaGiStore) to provide multifunctional spatial queries over big spatial data. Firstly, Figure 16. kNN query performance of VegaGiStore with different k values a geography-aware data organization approach is presented and # of nodes. to achieve high aggregate I/O throughput. The big spatial data are partitioned into blocks according to their geographic IV. R ELATED W ORK space and block size threshold. The adjacent spatial objects are stored sequentially into SOFile in terms of geographic There are quite a few early works on spatial query proximity. Secondly, in order to improve data retrieval processing by integrating spatial index into SDB. They efﬁciency, we design a two-tier distributed spatial index for are focus on pruning the search space while processing efﬁcient pruning of the search space. The index consists queries in Euclidean space [15], e.g., Quadtree [9], R-tree of quadtree-based global index and Hilbert-ordering local and their variants [16] are integrated into Oracle spatial index, and hence it could improve query efﬁciency with [17] and PostGIS [18]. SDB performs well with small low latency access. Thirdly, we propose an “indexing + spatial dataset [1]. However, limited to the ﬁxed schema MapReduce” data processing architecture to improve the and strict ACID2 semantics, SDB cannot provide efﬁcient spatial query computation capability of VegaGiStore. This spatial queries involving big spatial data. architecture takes advantage of data-parallel processing tech- LDD(Location Dependent Database) is a typical spatial- niques to provide both intra-query parallelism and inter- tagged database used for location-related data management. query parallelism, and thus can reduce individual spatial The LDD supports location context-aware information ap- query execution time and afford a large number of concur- plications in mobile environments [19]. However, LDD only rent spatial queries. We have compared VegaGiStore with the answer simple location-related attribute queries over small traditional spatial databases (i.e., PostGIS, Oracle spatial) textual dataset within a local area. and emerging distributed key-value stores (i.e.,Cassandra, Key-value store systems are emerging with web-scale HBase). The experimental results show that VegaGiStore data, and they are suitable for managing semi-structured has gained the best spatial query processing performance, data that can be represented by key-value model. Google’s and thus can meet high performance requirements of data- Bigtable is used to store the satellite imagery at many dif- intensive spatial applications. ferent levels of resolution for Google Earth product [6]. The open-sourced key-value stores such as HBase [7], Cassandra ACKNOWLEDGMENT [8] are widely used in web applications for storing textual This work is supported by National High Technology data or images. However, they cannot support efﬁcient Research and Development Program(863 Program) of China spatial queries due to ignorance of geographic proximity and (Grant No.2011AA120302 and No. 2011AA120300). The absence of spatial index. work is also funded by The CAS Special Grant for Postgrad- There are works to improve spatial query processing uate Research, Innovation and Practice. We would like to through revising traditional spatial indexes in distributed thank the anonymous reviewers for their valuable comments. environments. [20] and [21] propose solutions to improve R EFERENCES spatial queries in peer-to-peer environments; parallel R-tree [1] C. Yang, D. Wong, Q. Miao, and R. Yang, Advanced Geoin- 2 Atomicity, Consistency, Isolation, Durability formation Science, 1st ed. CRC Press, October 2009. 2093 2087 1990 [2] R. H. Güting, “An introduction to spatial database systems,” [19] D. L. Lee, J. Xu, B. Zheng, and W.-C. Lee, “Data man- The VLDB Journal, vol. 3, pp. 357–399, October 1994. agement in location-dependent information services,” IEEE Pervasive Computing, vol. 1, no. 3, pp. 65 – 72, 2002. [3] M. Egenhofer, “Spatial sql: a query and presentation lan- guage,” IEEE Transactions on Knowledge and Data Engi- [20] B. Liu, W.-C. Lee, and D. L. Lee, “Supporting complex neering, vol. 6, no. 1, pp. 86 –95, feb 1994. multi-dimensional queries in p2p systems,” in Proceedings of the 25th IEEE International Conference on Distributed [4] S. Shekhar and S. Chawla, Spatial Databases: A Tour, 1st ed. Computing Systems, ser. ICDCS ’05. Washington, DC, USA: Prentice Hall, June 2003. IEEE Computer Society, 2005, pp. 155–164. [5] Z. Shubin, H. Jizhong, L. Zhiyong, W. Kai, and X. Zhiyong, [21] E. Tanin, A. Harwood, and H. Samet, “Using a distributed “Sjmr: Parallelizing spatial join with mapreduce on clusters,” quadtree index in peer-to-peer networks,” The VLDB Journal, in IEEE International Conference on Cluster Computing, vol. 16, pp. 165–178, April 2007. 2009, pp. 1–8. [22] I. Kamel and C. Faloutsos, “Parallel r-trees,” SIGMOD Rec., [6] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. vol. 21, pp. 195–204, June 1992. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage system for structured data,” [23] D. DeWitt and J. Gray, “Parallel database systems: the future ACM Trans. Comput. Syst., vol. 26, pp. 4:1–4:26, June 2008. of high performance database systems,” Commun. ACM, vol. 35, pp. 85–98, June 1992. [7] “Hbase.” [Online]. Available: http://hbase.apache.org [8] A. Lakshman and P. Malik, “Cassandra: a decentralized structured storage system,” ACM SIGOPS Operating Systems Review, vol. 44, pp. 35–40, April 2010. [9] H. Samet, “The quadtree and related hierarchical data struc- tures,” ACM Comput. Surv., vol. 16, pp. 187–260, June 1984. [10] “Hadoop.” [Online]. Available: http://hadoop.apache.org [11] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed ﬁle system,” in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Tech- nologies (MSST), ser. MSST ’10. IEEE Computer Society, 2010, pp. 1–10. [12] J. Dean and S. Ghemawat, “Mapreduce: simpliﬁed data processing on large clusters,” Commun. ACM, vol. 51, pp. 107–113, January 2008. [13] H. Samet, “The quadtree and related hierarchical data struc- tures,” ACM Comput. Surv., vol. 16, pp. 187–260, June 1984. [14] X. Liu, J. Han, Y. Zhong, and C. Han, “Implementing webgis on hadoop: A case study of improving small ﬁle i/o performance on hdfs,” in IEEE International Conference on Cluster Computing, 2009, pp. 1–8. [15] V. Gaede and O. Günther, “Multidimensional access method- s,” ACM Comput. Surv., vol. 30, pp. 170–231, June 1998. [16] S. Brakatsoulas, D. Pfoser, and Y. Theodoridis, “Revisiting r-tree construction principles,” in Advances in Databases and Information Systems, ser. Lecture Notes in Computer Science, Y. Manolopoulos and P. NÃavrat, ˛ Eds. Springer Berlin / Heidelberg, 2002, vol. 2435, pp. 17–24. [17] R. K. V. Kothuri, S. Ravada, and D. Abugov, “Quadtree and r-tree indexes in oracle spatial: a comparison using gis data,” in Proceedings of the 2002 ACM SIGMOD international conference on Management of data, ser. SIGMOD ’02. New York, NY, USA: ACM, 2002, pp. 546–557. [18] “Postgis.” [Online]. Available: http://postgis.refractions.net/ 2094 2088 1991

References (22)

C. Yang, D. Wong, Q. Miao, and R. Yang, Advanced Geoin- formation Science, 1st ed. CRC Press, October 2009.
R. H. Güting, "An introduction to spatial database systems," The VLDB Journal, vol. 3, pp. 357-399, October 1994.
M. Egenhofer, "Spatial sql: a query and presentation lan- guage," IEEE Transactions on Knowledge and Data Engi- neering, vol. 6, no. 1, pp. 86 -95, feb 1994.
S. Shekhar and S. Chawla, Spatial Databases: A Tour, 1st ed. Prentice Hall, June 2003.
Z. Shubin, H. Jizhong, L. Zhiyong, W. Kai, and X. Zhiyong, "Sjmr: Parallelizing spatial join with mapreduce on clusters," in IEEE International Conference on Cluster Computing, 2009, pp. 1-8.
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A distributed storage system for structured data," ACM Trans. Comput. Syst., vol. 26, pp. 4:1-4:26, June 2008.
"Hbase." [Online]. Available: http://hbase.apache.org
A. Lakshman and P. Malik, "Cassandra: a decentralized structured storage system," ACM SIGOPS Operating Systems Review, vol. 44, pp. 35-40, April 2010.
H. Samet, "The quadtree and related hierarchical data struc- tures," ACM Comput. Surv., vol. 16, pp. 187-260, June 1984.
"Hadoop." [Online]. Available: http://hadoop.apache.org
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Tech- nologies (MSST), ser. MSST '10. IEEE Computer Society, 2010, pp. 1-10.
J. Dean and S. Ghemawat, "Mapreduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp. 107-113, January 2008.
H. Samet, "The quadtree and related hierarchical data struc- tures," ACM Comput. Surv., vol. 16, pp. 187-260, June 1984.
X. Liu, J. Han, Y. Zhong, and C. Han, "Implementing webgis on hadoop: A case study of improving small file i/o performance on hdfs," in IEEE International Conference on Cluster Computing, 2009, pp. 1-8.
V. Gaede and O. Günther, "Multidimensional access method- s," ACM Comput. Surv., vol. 30, pp. 170-231, June 1998.
S. Brakatsoulas, D. Pfoser, and Y. Theodoridis, "Revisiting r-tree construction principles," in Advances in Databases and Information Systems, ser. Lecture Notes in Computer Science, Y. Manolopoulos and P. NÃ ąvrat, Eds. Springer Berlin / Heidelberg, 2002, vol. 2435, pp. 17-24.
R. K. V. Kothuri, S. Ravada, and D. Abugov, "Quadtree and r-tree indexes in oracle spatial: a comparison using gis data," in Proceedings of the 2002 ACM SIGMOD international conference on Management of data, ser. SIGMOD '02. New York, NY, USA: ACM, 2002, pp. 546-557.
"Postgis." Available: http://postgis.refractions.net/ [19] D. L. Lee, B. and W.-C. Lee, "Data man- agement location-dependent information services," IEEE Pervasive Computing, vol. 1, no. 3, pp. -72,
B. Liu, W.-C. Lee, and D. L. Lee, "Supporting multi-dimensional queries in p2p systems," in Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, ser. ICDCS '05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 155-164.
E. Tanin, A. Harwood, and H. Samet, "Using a distributed quadtree index in peer-to-peer networks," The VLDB Journal, vol. 16, pp. 165-178, April 2007.
I. Kamel and C. Faloutsos, "Parallel r-trees," SIGMOD Rec., vol. 21, pp. 195-204, June 1992.
D. DeWitt and J. Gray, "Parallel database systems: the future of high performance database systems," Commun. ACM, vol. 35, pp. 85-98, June 1992.

Towards parallel spatial query processing for big spatial data

Sign up for access to the world's latest research

Abstract

Related papers

References (22)

Related papers

Related topics

Cited by