Academia.eduAcademia.edu

Outline

Efficient Spatial Query Processing for Big Data

https://doi.org/10.1145/2666310.2666481

Abstract

Spatial queries are widely used in many data mining and analytics applications. However, a huge and growing size of spatial data makes it challenging to process the spatial queries efficiently. In this paper we present a lightweight and scalable spatial index for big data stored in distributed storage systems. Experimental results show the efficiency and effectiveness of our spatial indexing technique for different spatial queries.

Efficient Spatial Query Processing for Big Data Kisung Lee †, Raghu K. Ganti ¶, Mudhakar Srivatsa ¶, Ling Liu † †College of Computing, Georgia Institute of Technology, Atlanta, GA USA ¶IBM T. J. Watson Research Center, Yorktown Heights, NY USA [email protected], {rganti, msrivats}@us.ibm.com, [email protected] ABSTRACT we investigate the problem of developing efficient and scal- Spatial queries are widely used in many data mining and able techniques for processing spatial queries over big spa- analytics applications. However, a huge and growing size tial data. Specifically, we present a lightweight spatial index of spatial data makes it challenging to process the spatial based on a hierarchical spatial data structure. Our spatial queries efficiently. In this paper we present a lightweight index has several advantages. First, it can be easily applied and scalable spatial index for big data stored in distributed to existing storage systems without modifying their inter- storage systems. Experimental results show the efficiency nal implementation and thus we can utilize existing systems and effectiveness of our spatial indexing technique for differ- as they are. Second, it provides simple yet highly efficient ent spatial queries. filtering, based on prefix matching, for finding only rele- vant spatial objects. Last but not the least, it supports effi- cient updates of spatial objects because it does not maintain 1. INTRODUCTION any costly data structure such as trees. In this paper, we Many real-world and online activities are associated with demonstrate how we implement the spatial index on top of their spatial information. For example, when we make or HBase without modifying its internal implementation. We receive a call, the call information including its cell tower also provide experimental results to show the efficiency and location is stored as a call detail record (CDR). Even a sin- effectiveness of our spatial indexing techniques. gle tweet message of Twitter can be stored with its detailed location (i.e., latitude and longitude) [1]. To extract more 2. PRELIMINARY valuable and meaningful information from such spatial data, In this section, we give an overview of spatial queries, spatial queries are widely used in many data mining and hierarchical spatial data structure and distributed storage analytics applications. One of the most representative chal- systems. We also outline the related work. lenges for processing the spatial queries is that the amount of spatial data is increasing at an unprecedented rate, espe- 2.1 Spatial Queries cially thanks to the widespread use of GPS-enabled smart- phones. Due to this huge size of spatial data, we need new There are many types of spatial queries, such as selection scalable techniques which can process the spatial queries ef- query, join query and k nearest neighbor (kNN) query, for ficiently. different applications. Even though there are more spatial To handle such huge spatial data, it is natural to uti- relations [8], in this paper, we focus on selected fundamental lize emerging distributed computing technologies such as queries which are basis for many other spatial queries: con- Hadoop MapReduce, Hadoop Distributed File System taining, containedIn, intersects and withinDistance. Those (HDFS) and HBase. Several techniques have been proposed queries are defined for any geometries including points, to support spatial queries on Hadoop MapReduce [7, 11, lines, rectangles and polygons. A containing(search ge- 4, 12] or HDFS [5, 6]. However, most of them require in- ometry) query returns all spatial objects that contain the ternal modification of underlying systems or frameworks to given search geometry. A containedIn(search geometry) implement their indexing techniques based on, for example, query returns all spatial objects that are contained by the R-trees. Those approaches not only increase the complexity given search geometry (i.e., the converse of containing). and overhead of the modified storage systems but also are An intersects(search geometry) query returns all spatial applicable only to a specific storage system. objects that intersect with the given search geometry. A To tackle the limitations of existing work, in this paper, withinDistance(search geometry, distance) query (or range query) returns all spatial objects that are within the Permission to make digital or hard copies of all or part of this work for given distance from the the given search geometry. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components 2.2 Hierarchical Spatial Data Structure of this work owned by others than ACM must be honored. Abstracting with For our spatial indexing, we utilize a hierarchical spatial credit is permitted. To copy otherwise, or republish, to post on servers or to data structure, called geohash [2], which is a geocoding sys- redistribute to lists, requires prior specific permission and/or a fee. Request tem for latitude and longitude. A geohash code, represented permissions from [email protected]. as a string, basically denotes a rectangle (bounding box) on SIGSPATIAL ’14, November 04 - 07 2014, Dallas/Fort Worth, TX, USA Copyright 2014 ACM 978-1-4503-3131-9/14/11 ...$15.00 the earth. It provides a spatial hierarchy and it can reduce http://dx.doi.org/10.1145/2666310.2666481 the precision (i.e., represent a bigger rectangle) by removing characters from the end of the string. In other words, the in a minimum geohash set have the same length and thus longer the geohash code is, the smaller the bounding box represent the same precision. represented by the code is. Another property of geohash is Similar to other indexing techniques such as R-trees, the that two places with a long common geohash prefix are close query processing based on our spatial index basically con- each other. Similarly, nearby places usually share a similar sists of two main steps: filter step and refinement step. prefix. However, it is not always guaranteed that two close Given a spatial query Q, in the filter step, we find candidate places share a long common prefix. spatial objects, which may satisfy the query condition of Q, by pruning non-qualifying spatial objects. In the refinement 2.3 Distributed Storage Systems step, we examine each candidate spatial object to determine A growing number of non-relational distributed databases whether the object is actually satisfying the query condi- (often called NoSQL databases) are proposed and widely tion of Q. We define the precision of query processing for used in many big data applications and analytics because Q as the ratio of actual spatial objects satisfying the query they are designed to run on a large cluster of commodity condition of Q to all evaluated candidate spatial objects. hardware and fault-tolerant through data replication. One To develop our spatial index on top of HBase, we propose representative category of the NoSQL databases is the to utilize HBase row keys to indicate the geohash codes for key-value store, in which data is stored in a schema-less way stored spatial objects. Specifically, given a spatial object SO via an unique key that represents each row, such as Apache to be stored and indexed by our spatial index, for each geo- HBase, Apache Accumulo, Apache Cassandra, Google hash code in its minimum geohash set minGeohash(SO), BigTable, Amazon DynamoDB, just to name a few. In this we store the spatial object in the HBase row having the paper, our description is based on HBase, an open-source geohash code as its row key. We use an uniquely assigned key-value store (or wide column store) originally derived identifier for the object as its column name (qualifier). We from BigTable, because it is widely used by many big data allow replication of spatial objects in multiple HBase rows applications. However, we believe that our spatial index is for efficient processing of spatial queries as we will explain applicable to other key-value stores similarly because we below. For example, if the minimum geohash set of a spa- use only keys for our index without modifying the internal tial object is {“dn5bpsby”, “dn5bpsbv”}, we store the spatial structure of HBase. object in two HBase rows whose keys are “dn5bpsby” and “dn5bpsbv”. Note that our replication of spatial objects is 2.4 Related Work not related to the data block replication of underlying HDFS We classify existing spatial query processing techniques for its fault-tolerance. using distributed computing frameworks into two cate- According to the definition of the geohash, longer geo- gories, based on their query types. The first category hash codes will be generated for smaller geometries. If there handles high selectivity queries, such as selection queries are many spatial objects associated with a tiny geometry, a and kNN queries, in which only a small portion of spatial huge number of HBase rows having a long row key may be objects are returned as the result of spatial query process- created to store the objects and each row will likely include ing. A few techniques have been proposed to process the only a few spatial objects. Since too many HBase rows can high selectivity queries in HDFS [5, 6]. They are utilizing aggravate the performance of our spatial query processing, popular spatial indices such as an R-tree and its variants. we need to control the number of HBase rows. To limit The second category handles low selectivity queries which the number of HBase rows, we utilize the hierarchical fea- usually require at least one full scan of each dataset. One ture of the geohash codes. By setting the maximum length of the most representative low selectivity spatial queries of geohash codes (i.e., length of HBase row keys), we can is k nearest neighbor join (kN N join) which is to find, store those spatial objects associated with a tiny geometry for each object in a dataset A, its k nearest neighbors in in HBase rows representing a bigger rectangle and thus re- another dataset B. Several techniques have been proposed duce the number of HBase rows. to process the kN N (or similar) joins using the MapReduce To execute spatial queries for the stored and indexed spa- framework [7, 11, 4, 12]. tial objects in HBase, we utilize the properties of the geo- hash codes to find only relevant HBase rows and thus re- 3. SPATIAL QUERY PROCESSING duce the search space considerably. Let us assume that a spatial query Q with its search geometry QG is given. We A spatial object basically includes its geometry and can first calculate the minimum geohash set of Q which fully have any additional information about the object, such as cover QG . If the query is containing(search geometry), we its name, address and phone number. In terms of the ge- select only those HBase rows whose row key is a prefix of ometry, our spatial index supports most of generally used one of the geohash codes in the minimum geohash set. This geometries including points, lines, rectangles, curves and is because those spatial objects which contain the search polygons. Given a spatial object to be stored and indexed geometry should have at least the same or larger rectan- by our spatial index, we first calculate a set of minimum gles than the search geometry. As we explained above, a bounding boxes (i.e., geohash codes), called minimum geo- geohash code representing a rectangle is a prefix of those hash set, which fully cover the geometry of the spatial ob- geohash codes representing the sub-rectangles of the rect- ject. To prevent generating too many fine-grained bounding angle. Therefore, we can efficiently select candidate HBase boxes to cover the geometry and thus increasing the over- rows which may store spatial objects containing the search head of managing the spatial object, we set the maximum geometry, using the prefix match. Specifically, to find can- number of bounding boxes for each geometry to 10 in the didate HBase rows, we scan all possible prefixes for each first prototype of our spatial index. The maximum number geohash code in the minimum geohash set. For example, for of bounding boxes for each geometry can be configured for a geohash code “dn5b” included in the minimum geohash different applications. Also, all the geohash codes included 25 12 our index set, we scan for key “d”, “dn”, “dn5” and “dn5b”. Finally, for Query Processing Time Ratio Query Processing Time Ratio our index baseline (latitude) 10 20 baseline (latitude) baseline (longitude) each candidate HBase row, we read all spatial objects stored 15 8 baseline (longitude) in the row and return those spatial objects which actually 10 6 contain the search geometry. 5 4 2 If the query is containedIn(search geometry), an intuitive 0 0 approach is to select only those HBase rows whose row key 1~10 11~100 101~1K 1K~10K 10K~100K 100K~1M 1M~3M Selectivity (# query result records) 101~1K 1K~10K 10K~100K Selectivity (# query result records) 100K~1M includes one of the geohash codes, included in the minimum (a) withinDistance queries (b) containedIn queries geohash set, as its prefix because containedIn is the con- verse of containing. However, we need to take into account that we set the maximum length of geohash codes to prevent Figure 1: Query Processing Time generating too many small HBase rows. For example, let us assume that the minimum geohash set of a spatial object is {“dn5bpsby”} and the spatial object is stored in a HBase row read the stored spatial objects in the row and return those whose row key is “dn5bp” because the maximum length of spatial objects which are actually within the distance from geohash codes is 5. Also, assume that a containedIn(search the the search geometry. geometry) query in which the minimum geohash set of the search geometry is {“dn5bpsb”} is given and the search ge- 4. EXPERIMENTAL EVALUATION ometry actually contains the spatial object. Based on the For evaluation of our spatial index on top of HBase, we intuitive approach, we cannot select the HBase row “dn5bp” use HBase (Version 0.96) and Hadoop (Version 1.0.4) run- because “dn5bp” does not include “dn5bpsb” as its prefix. ning on Java 1.6.0, installed on a cluster of 11 physical ma- To tackle this problem, we also apply the maximum length chines (one master machine) on Emulab [10]: each has 12GB to the geohash codes included in the minimum geohash set RAM, one 2.4 GHz 64-bit quad core Xeon E5530 processor of the spatial query (from “dn5bpsb” to “dn5bp” in the pre- and two 7200 rpm SATA disks (500GB and 250GB). We vious example) and then use the intuitive approach. When run HBase RegionServers on the same machines as DataN- we select candidate HBase rows whose row key includes one odes and a ZooKeeper ensemble of 3 machines. For each of the geohash codes, included in the minimum geohash set setting and each query, our spatial query processing time of the spatial query, as its prefix, we utilize a range scan indicates the fastest time after running five cold runs to re- of HBase for each geohash code. Specifically, for each geo- move any possible bias posed by OS and/or network activity. hash code included in the minimum geohas set, we execute We use GeoLife GPS Trajectories (GeoLife in short) [13] and a range scan whose start row is the geohash code and end San Francisco taxi cab traces (SFTaxi in short) [9] for our row is the lexicographically next geohash code, having the experiments. GeoLife and SFTaxi contain 24,876,977 and same length, to access all HBase rows whose row key has 11,219,955 GPS point records respectively. the geohash code as its prefix. For example, for a geohash We first present spatial query processing performance us- code “dn5b”, we execute a range scan from “dn5b” to “dn5c”. ing our index on top of HBase running on HDFS. As our For each selected HBase row, we read the stored spatial ob- baseline approach, we store the spatial objects using their jects in the row and return those spatial objects which are latitude (or longitude) as a row key of HBase (i.e., one di- actually contained in the search geometry. mensional index). We choose this approach as our base- If the query is intersects(search geometry), we consider line because it can be also implemented without modifying both prefix cases when we select candidate HBase rows. This HBase and, similar to our spatial index, HBase range scans is because, if there is any intersecting region between the can be utilized for fair comparisons. For example, given a search geometry and the geometry of a spatial object, both containedIn query, we use the leftmost and rightmost lati- geometries should have a rectangle(s) (i.e., geohash code) tudes (or longitudes) of the query geometry as the start and which includes the intersecting region and any two differ- end row keys of a HBase range scan respectively. ent rectangles including the same region should have their We implement a Hadoop MapReduce job to efficiently hierarchy (i.e., one is the sub-rectangle of the other) ac- store the spatial objects in HBase. Also, we represent each cording to the definition of the geohash codes. Since we do geohash code as a binary array, instead of a string, to effi- not know which geometry has a bigger rectangle covering ciently handle geohash codes. By default, we empirically the intersecting region until we evaluate the spatial object, choose 40 bits as the maximum length of geohash codes we select those HBase rows, as candidate rows, whose row because we think that value strikes a balance between the key is a prefix of one of the geohash codes included in the number of rows and the number of columns of each row. minimum geohash set of the spatial query or includes one 2,608,848, 4,744,257 and 4,886,185 HBase rows are generated of the geohash codes as its prefix. For each selected HBase to store the spatial objects using our index, the latitude- row, we read the stored spatial objects in the row and return based baseline approach and longitude-based baseline ap- those spatial objects which are actually intersecting with the proach respectively. search geometry. In this paper, we report the results of withinDistance For a withinDistance(search geometry, distance) query, and containedIn queries. We generate 300 withinDistance we first calculate the minimum geohash set which covers queries by randomly selecting a point in the datasets and the extended geometry computed by adding the distance to using a distance of 10m, 100m or 1km. This generation the search geometry. Then, similar to the intersects query process guarantees that we get at least one point record as processing, we select those HBase rows, as candidate rows, the output of each query execution. We also generate 100 whose row key is a prefix of one of the geohash codes in- containedIn queries by randomly selecting two points in the cluded in the minimum geohash set or includes one of the datasets and using them as the lower-left and upper-right geohash codes as its prefix. For each selected HBase row, we points of a rectangle. 1000 processing time (sec -log) 1,000,000 # accessed rows (log) scalable and lightweight spatial index which can be easily 100 100,000 applied to existing systems without modifying their internal 10,000 implementation. Outperforming the pruning power of R- 10 1,000 tree-based indices is not the purpose of this paper because 100 1 R-tree-based indices maintain expensive data structures and 10 0.1 1 mostly require internal and complicated modification of the 10 100 1000 10000 10 100 1000 10000 storage systems. Nevertheless, the precision results in Fig. 3 range (m) range (m) show that our index has one order of magnitude higher pre- (a) Processing Time (b) # accessed rows cision than the R-tree-based index for those queries having very high selectivity (selecting less than 10 records). Our Figure 2: Effects of different distances spatial index demonstrates relatively consistent precision for different selectivity levels while the R-tree-based index has 100% higher precision for less selective queries. 90% R-tree our index 80% 70% 5. CONCLUSION Precision 60% 50% In this paper we have proposed efficient and scalable spa- 40% tial indexing techniques for big data stored in distributed 30% 20% storage systems. Based on a hierarchical spatial data struc- 10% ture, called geohash, we have presented how we develop a 0% 1~10 11~100 101~1K 1K~10K 10K~100K 100K~1M 1M~3M lightweight spatial index for big data stored in a distributed Selectivity (# query result records) file system, especially on top of HBase. Figure 3: Precision comparison (withinDistance) 6. ACKNOWLEDGMENTS This work was performed while Kisung Lee was an intern For brevity, we first categorize the queries based on their at IBM Research T.J. Watson. selectivity and then compare our query processing perfor- mance with that of the baseline approach using the ratio of 7. REFERENCES their query processing times where we set our query process- [1] Geo Developer Guidelines. ing time to 1, as shown in Fig. 1. The query processing with https://dev.twitter.com/terms/geo-developer-guidelines. our spatial index is more than one order of magnitude faster [2] Geohash. http://geohash.org/. than both the latitude-based and longitude-based baseline [3] JSI RTree Library. http://jsi.sourceforge.net/. approaches, on average, for those withinDistance queries [4] A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi-Based Geospatial Query Processing which select less than 10,000 records, as shown in Fig. 1(a). with MapReduce. In CLOUDCOM ’10, 2010. As we decrease the selectivity of queries, the performance [5] H. Liao, J. Han, and J. Fang. Multi-dimensional Index on gain of our spatial index also drops because retrieving a large Hadoop Distributed File System. In NAS ’10, 2010. number of rows for query evaluation is inevitable. However, [6] X. Liu, J. Han, Y. Zhong, C. Han, and X. He. the query processing with our spatial index is still 30% faster Implementing WebGIS on Hadoop: A case study of than the latitude-based baseline approach, on average, for improving small file I/O performance on HDFS. In those withinDistance queries which select more than 1 mil- CLUSTER ’09, 2009. lion records. For containedIn queries, even though our query [7] W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient Processing of K Nearest Neighbor Joins Using MapReduce. processing is still more than one order of magnitude faster Proc. VLDB Endow., 5(10), June 2012. than the latitude-based baseline approach for queries having [8] D. Papadias, T. Sellis, Y. Theodoridis, and M. J. high selectivity as shown in Fig. 1(b), its performance gain Egenhofer. Topological Relations in the World of Minimum is generally smaller than that for withinDistance queries. Bounding Rectangles: A Study with R-trees. SIGMOD This is primarily because containedIn queries usually cover Rec., 24(2), May 1995. a wider region than withinDistance queries and thus the [9] M. Piorkowski, N. Sarafijanovoc-Djukic, and pruning power of the baseline approaches is higher for con- M. Grossglauser. A Parsimonious Model of Mobile Partitioned Networks with Clustering. In COMSNETS’09. tainedIn queries. Specifically, the average precisions (i.e., [10] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, the ratio of true positives to all evaluated candidate spa- M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An tial objects) of the latitude-based baseline approach are 8% Integrated Experimental Environment for Distributed and 12% for withinDistance queries and containedIn queries Systems and Networks. SIGOPS Oper. Syst. Rev., 36, 2002. respectively. [11] C. Zhang, F. Li, and J. Jestes. Efficient Parallel kNN Joins Fig. 2 shows the query processing results using different for Large Data in MapReduce. In EDBT ’12, 2012. distances for the same query point of a withinDistance query. [12] S. Zhang, J. Han, Z. Liu, K. Wang, and S. Feng. Spatial The query processing time understandably increases as we Queries Evaluation with MapReduce. In GCC ’09, 2009. enlarge the query region because more HBase rows are ac- [13] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining Interesting Locations and Travel Sequences from GPS cessed and thus more candidate records are evaluated for Trajectories. In WWW ’09, 2009. query processing. Finally, we compare the pruning power of our spatial in- dex with that of an R-tree-based index. We use an open source R-tree implementation [3] for this evaluation. We want to emphasize that the focus of this paper is on the

References (14)

  1. REFERENCES
  2. Geo Developer Guidelines. https://dev.twitter.com/terms/geo-developer-guidelines.
  3. Geohash. http://geohash.org/.
  4. JSI RTree Library. http://jsi.sourceforge.net/.
  5. A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi-Based Geospatial Query Processing with MapReduce. In CLOUDCOM '10, 2010.
  6. H. Liao, J. Han, and J. Fang. Multi-dimensional Index on Hadoop Distributed File System. In NAS '10, 2010.
  7. X. Liu, J. Han, Y. Zhong, C. Han, and X. He. Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS. In CLUSTER '09, 2009.
  8. W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient Processing of K Nearest Neighbor Joins Using MapReduce. Proc. VLDB Endow., 5(10), June 2012.
  9. D. Papadias, T. Sellis, Y. Theodoridis, and M. J. Egenhofer. Topological Relations in the World of Minimum Bounding Rectangles: A Study with R-trees. SIGMOD Rec., 24(2), May 1995.
  10. M. Piorkowski, N. Sarafijanovoc-Djukic, and M. Grossglauser. A Parsimonious Model of Mobile Partitioned Networks with Clustering. In COMSNETS'09.
  11. B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An Integrated Experimental Environment for Distributed Systems and Networks. SIGOPS Oper. Syst. Rev., 36, 2002.
  12. C. Zhang, F. Li, and J. Jestes. Efficient Parallel kNN Joins for Large Data in MapReduce. In EDBT '12, 2012.
  13. S. Zhang, J. Han, Z. Liu, K. Wang, and S. Feng. Spatial Queries Evaluation with MapReduce. In GCC '09, 2009.
  14. Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining Interesting Locations and Travel Sequences from GPS Trajectories. In WWW '09, 2009.