Efficient Spatial Query Processing for Big Data
https://doi.org/10.1145/2666310.2666481…
4 pages
1 file
Sign up for access to the world's latest research
Abstract
Spatial queries are widely used in many data mining and analytics applications. However, a huge and growing size of spatial data makes it challenging to process the spatial queries efficiently. In this paper we present a lightweight and scalable spatial index for big data stored in distributed storage systems. Experimental results show the efficiency and effectiveness of our spatial indexing technique for different spatial queries.
Related papers
Journal of Statistics and Management Systems , 2018
The web is being used more and more by users of mobile devices. In addition, it is increasingly possible to track the user’s location, which provides immense opportunities in geospatial data and its management. Due to the use of location information in services for each mobile device, a large size of spatial data makes it difficult to process spatial queries efficiently and, therefore, we need a lightweight and scalable approach to process large amounts of stored data in distributed file systems. For the most part, all SNSs (social network services) focus on connecting the user account with their location information, such as check-in services, which helps them collect information about user activities and ratings of location, but also increases the load of data on their servers. . In this article we propose an indexing technique in combination with efficient processing of Boolean top-k spatial queries where location data is compressed to save space and the Boolean query helps filter results so that unrelated data is not processed, what helps to save space and faster processing of queries.
2018
Spatial information processing has been a centre of attention of research in the previous decade. In spatial databases, data related with spatial coordinates and extents are retrieved based on spatial proximity. A large number of spatial indexes have been proposed to make ease of efficient indexing of spatial objects in large databases and spatial data retrieval. The goal of this paper is to review the advance techniques of the access methods. This paper tries to classify the existing multidimensional access methods, according to the types of indexing, and their performance over spatial queries. K-d trees out performs quad tress without requiring additional memory usage.
The web is being used more and more by users of mobile devices. In addition, it is increasingly possible to track the user's location, which provides immense opportunities in geospatial data and its management. Due to the use of location information in services for each mobile device, a large size of spatial data makes it difficult to process spatial queries efficiently and, therefore, we need a lightweight and scalable approach to process large amounts of stored data in distributed file systems. For the most part, all SNSs (social network services) focus on connecting the user account with their location information, such as check-in services, which helps them collect information about user activities and ratings. Of location, but also increases the load of data on their servers. . In this article we propose an indexing technique in combination with efficient processing of Boolean top-k spatial queries where location data is compressed to save space and the Boolean query helps filter results so that unrelated data is not processed, what helps to save space and faster processing of queries.
2001
Emerging database applications require the use of new indexing structures beyond B-trees and R-trees. Examples are the k-D tree, the trie, the quadtree, and their variants. They are often proposed as supporting structures in data mining, GIS, and CAD/CAM applications. A common feature of all these indexes is that they recursively divide the space into partitions. A new extensible index structure, termed SP-GiST, is presented that supports this class of data structures, mainly the class of space partitioning unbalanced trees. Simple method implementations are provided that demonstrate how SP-GiST can behave as a k-D tree, a trie, a quadtree, or any of their variants. Issues related to clustering tree nodes into pages as well as concurrency control for SP-GiST are addressed. A dynamic minimum-height clustering technique is applied to minimize disk accesses and to make using such trees in database systems possible and efficient. A prototype implementation of SP-GiST is presented as well as performance studies of the various SP-GiST's tuning parameters.
International Journal of Information and Decision Sciences, 2020
Nowadays, real-time spatial applications have become more and more important. Such applications result dynamic environments where data as well as queries are continuously moving. As a result, there is a tremendous amount of real-time spatial data generated every day. The growth of the data volume seems to outspeed the advance of databases and data warehouses especially that users expect to receive the results of each query within a short time period without holding into account the load of the system. To solve this problem, several optimisation techniques are used. Thus, we propose, as a first contribution, a novel data partitioning approach for real-time spatial big data named vertical partitioning approach for real-time spatial big data (VPA-RTSBD). This contribution is an implementation of the matching algorithm for traditional vertical partitioning. Then, as a second contribution, we propose a new frequent itemset mining approach which relaxes the notion of window size and proposes a new algorithm named PrePost*-RTSBD. Thereafter, a simulation study is shown to prove that our contributions can achieve a significant performance improvement.
arXiv (Cornell University), 2015
Recently, MapReduce based spatial query systems have emerged as a cost effective and scalable solution to large scale spatial data processing and analytics. MapReduce based systems achieve massive scalability by partitioning the data and running query tasks on those partitions in parallel. Therefore, effective data partitioning is critical for task parallelization, load balancing, and directly affects system performance. However, several pitfalls of spatial data partitioning make this task particularly challenging. First, data skew is very common in spatial applications. To achieve best query performance, data skew need to be reduced to the minimum. Second, spatial partitioning approaches generate boundary objects that cross multiple partitions, and add extra query processing overhead. Therefore, boundary objects need to be minimized. Third, the high computational complexity of spatial partitioning algorithms combined with massive amounts of data require an efficient approach for partitioning to achieve overall fast query response. In this paper, we provide a systematic evaluation of multiple spatial partitioning methods with a set of different partitioning strategies, and study their implications on the performance of MapReduce based spatial queries. We also study sampling based partitioning methods and their impact on queries, and propose several MapReduce based high performance spatial partitioning methods. The main objective of our work is to provide a comprehensive guidance for optimal spatial data partitioning to support scalable and fast spatial data processing in distributed computing environments such as MapReduce. The algorithms developed in this work are open source and can be easily integrated into different high performance spatial data processing systems.
Proceedings of the 2017 ACM International Conference on Management of Data
The widespread use of GPS-enabled cellular devices, i.e., smart phones, led to the popularity of numerous mobile applications, e.g., social networks, micro-blogs, mobile web search, and crowd-powered reviews. These applications generate large amounts of geo-tagged textual data, i.e., spatialkeyword data. This data needs to be processed and queried at an unprecedented scale. The management of spatialkeyword data at this scale goes beyond the capabilities of centralized systems. We live in the era of big data and the big data model is currently been used to address scalability issues in various application domains. This has led to the development of various big spatial-keyword processing systems. These systems are designed to ingest, store, index, and query huge amounts of spatial-keyword data. In this 1.5 hour tutorial, we explore recent research efforts in the area of big spatial-keyword processing. First, we give main motivations behind big spatial-keyword systems with real-life applications. We describe the main models for big spatialkeyword processing, and list the popular spatial-keyword queries. Then, we present the approaches that have been adopted in big spatial-keyword processing systems with special attention to data indexing and spatial and keyword data partitioning. Finally, we conclude this tutorial with a discussion on some of the open problems and research directions in the area of big spatial-keyword query processing.
2017
Nowadays, a vast amount of data is generated and collected every moment and often, this data has a spatial and/or temporal aspect. To analyze the massive data sets, big data platforms like Apache Hadoop MapReduce and Apache Spark emerged and extensions that take the spatial characteristics into account were created for them. In this paper, we analyze and compare existing solutions for spatial data processing on Hadoop and Spark. In our comparison, we investigate their features as well as their performances in a micro benchmark for spatial filter and join queries. Based on the results and our experiences with these frameworks, we outline the requirements for a general spatio-temporal benchmark for Big Spatial Data processing platforms and sketch first solutions to the identified problems.
Geoinformatica
Spatial data warehouses (SDWs) allow for spatial analysis together with analytical multidimensional queries over huge volumes of data. The challenge is to retrieve data related to ad hoc spatial query windows according to spatial predicates, avoiding the high cost of joining large tables. Therefore, mechanisms to provide efficient query processing over SDWs are essential. In this paper, we propose two efficient indices for SDW: the SB-index and the HSB-index. The proposed indices share the following characteristics. They enable multidimensional queries with spatial predicate for SDW and also support predefined spatial hierarchies. Furthermore, they compute the spatial predicate and transform it into a conventional one, which can be evaluated together with other conventional predicates by accessing a star-join Bitmap index. While the SB-index has a sequential data structure, the HSB-index uses a hierarchical data structure to enable spatial objects clustering and a specialized buffer-pool to decrease the number of disk accesses. The advantages of the SB-index and the HSB-index over the DBMS resources for SDW indexing (i.e. star-join computation and materialized views) were investigated through performance tests, which issued roll-up operations extended with containment and intersection range queries. The performance results showed that improvements ranged from 68% up to 99% over both the star-join computation and the materialized view. Furthermore, the proposed indices proved to be very compact, adding only less than 1% to the storage requirements. Therefore, both the SB-index and the HSB-index are excellent choices for SDW indexing. Choosing between the SB-index and the HSB-index mainly depends on the query selectivity of spatial predicates. While low query selectivity benefits the HSB-index, the SB-index provides better performance for higher query selectivity.
References (14)
- REFERENCES
- Geo Developer Guidelines. https://dev.twitter.com/terms/geo-developer-guidelines.
- Geohash. http://geohash.org/.
- JSI RTree Library. http://jsi.sourceforge.net/.
- A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi-Based Geospatial Query Processing with MapReduce. In CLOUDCOM '10, 2010.
- H. Liao, J. Han, and J. Fang. Multi-dimensional Index on Hadoop Distributed File System. In NAS '10, 2010.
- X. Liu, J. Han, Y. Zhong, C. Han, and X. He. Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS. In CLUSTER '09, 2009.
- W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient Processing of K Nearest Neighbor Joins Using MapReduce. Proc. VLDB Endow., 5(10), June 2012.
- D. Papadias, T. Sellis, Y. Theodoridis, and M. J. Egenhofer. Topological Relations in the World of Minimum Bounding Rectangles: A Study with R-trees. SIGMOD Rec., 24(2), May 1995.
- M. Piorkowski, N. Sarafijanovoc-Djukic, and M. Grossglauser. A Parsimonious Model of Mobile Partitioned Networks with Clustering. In COMSNETS'09.
- B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An Integrated Experimental Environment for Distributed Systems and Networks. SIGOPS Oper. Syst. Rev., 36, 2002.
- C. Zhang, F. Li, and J. Jestes. Efficient Parallel kNN Joins for Large Data in MapReduce. In EDBT '12, 2012.
- S. Zhang, J. Han, Z. Liu, K. Wang, and S. Feng. Spatial Queries Evaluation with MapReduce. In GCC '09, 2009.
- Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining Interesting Locations and Travel Sequences from GPS Trajectories. In WWW '09, 2009.