0% found this document useful (0 votes)
127 views6 pages

RDF Querying with Apache Spark Review

This document provides an overview of existing works dealing with efficient query answering over RDF data using Apache Spark. It discusses the key dimensions for analyzing such systems, including the data model and Apache Spark abstraction used. The document classifies approaches according to these dimensions and provides an in-depth overview of each approach, highlighting novel ideas and drawbacks. It identifies gaps in the current research and provides directions for future work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views6 pages

RDF Querying with Apache Spark Review

This document provides an overview of existing works dealing with efficient query answering over RDF data using Apache Spark. It discusses the key dimensions for analyzing such systems, including the data model and Apache Spark abstraction used. The document classifies approaches according to these dimensions and provides an in-depth overview of each approach, highlighting novel ideas and drawbacks. It identifies gaps in the current research and provides directions for future work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

RDF Query Answering Using Apache Spark:

Review and Assessment

Giannis Agathangelos1 , Georgia Troullinou1 , Haridimos Kondylakis1 , Kostas Stefanidis2 , Dimitris Plexousakis1
1
ICS-FORTH, Greece {jagathan, troulin, kondylak, dp}@[Link]
2
University of Tampere, Finland [Link]@[Link]

Abstract—The explosion of the web and the abundance of systems are beyond the scope of the former, as the authors
linked data demand for effective and efficient methods for claim in the first paper, whereas they both cover mainly works
storage, management and querying. More specifically, the ever- before the prevalence of Spark. As such, our work can be seen
increasing size and number of RDF data collections raises the as complementary to the aforementioned surveys, shedding
need for efficient query answering, and dictate the usage of light to the area of RDF query answering, specifically, on
distributed data management systems for effectively partitioning
and querying them. To this direction, Apache Spark is one
works using Apache Spark as the underlying data management
of the most active big-data approaches, with more and more infrastructure. From a different perspective, [8] presents a
systems adopting it, for efficient, distributed data management. preliminary experimental comparison, evaluating Spark im-
The purpose of this paper is to provide an overview of the existing plementations for RDF systems, focusing on techniques for
works dealing with efficient query answering, in the area of RDF distributing data. Specifically, the authors analyze five repre-
data, using Apache Spark. We discuss on the characteristics and sentative RDF data distribution approaches. The purpose there
the key dimension of such systems, we describe novel ideas in the is to provide a clear view about the most efficient distribution
area, and the corresponding drawbacks, and provide directions solution, for a given context among the evaluated solutions,
for future work. and to show the challenges, each approach has, when it comes
to Spark implementation. However, in our paper we do not
I. I NTRODUCTION attempt to do a comparative experimental evaluation to a
limited number of approaches, but to identify and present an
The prevalence of Open Linked Data, and the explosion overview of the main research directions in the area.
of available information on the Web, have led to an enormous
amount of widely available RDF datasets [6]. To store, manage The rest of this paper is structured as follows: In Section
and query these ever increasing RDF data, many systems II, we present some background required when speaking about
have been developed by the research community and by RDF data. Then, in Section II, we define the dimensions we
commercial vendors. To this direction, distributed big data use for describing the systems presented in Section IV. Finally,
processing engines, like Hadoop, HBase and Impala [14], are Section V concludes this paper, identifies gaps in the area and
exploited more and more for this purpose due to their ability to presents directions for future work.
effectively handle mass amounts of data. Apache Spark, is one
of the most active, big data approach with an ever increasing
interest in using it for efficient query answering over RDF data.
The platform uses in-memory data structures that can be used II. BACKGROUND
to store RDF data, offering increasing efficiency, and enabling
effective distributed query answering. A. The Resource Description Framework (RDF)

As such, the goal of this work is to provide an overview The representation of knowledge in RDF is based on triples
of the works dealing with efficient query answering, using of the form of (subject predicate object) which record that
Apache Spark, for RDF data. Focusing on this specific field, subject is related to object via predicate. Formally, represen-
we fill in the gap in the literature, providing a complete tation of RDF data is based on three disjoint and infinite sets of
and detailed overview of the current research activities in the resources, namely: URIs (U ), literals (L) and blank nodes (B).
area. More specifically, our contributions are the following. RDF allows representing a form of incomplete information
Firstly, we present and discuss various dimensions of analysis, through blank nodes, standing for unknown constants or URIs.
identifying key elements for such systems. Then, we classify As such, a triple is a tuple (subject predicate object) from
the approaches according to the data model and the apache (U ∪B)×U ×(U ∪L∪B). In addition, to state that a resource
Spark abstraction they use. We proceed further to perform an in r is of a type ,the property rdf : type is used.
depth overview of the approaches in each category, providing a
unique perspective on the research in the area, and highlighting RDF datasets have attached semantics through RDFS [1], a
the novel ideas and the drawbacks of each one. Finally, we vocabulary description language. RDF Schema is a vocabulary
identify what is missing from the area and provide interesting description language that includes a set of inference rules used
directions for future work. to generate new, implicit triples from explicit ones.

There are already surveys in the area of generic RDF Finally, a collection of triples can be represented as a
storage [11] and on RDF data management systems in cloud labeled directed graph, in which nodes represent subjects or
environments [15]. However, distributed RDF query answering objects and labeled directed edges represent predicates.
B. Querying On top of RDD and DataFrames, Spark proposes two
higher-level data access models, GraphX and Spark SQL, for
For querying RDF data, SPARQL is used. SPARQL [2] processing semi-structured data in general. Those data models
is currently the standard query language for the semantic web can be used to handle RDF data and SPARQL queries. Spark
and has become an official W3C recommendation. Essentially, GraphX [21] is a library enabling graph processing by extend-
SPARQL is a graph-matching language. SPARQL queries ing the RDD abstraction and hence introduces a new feature
contain a set of triples patterns, also called basic graph patterns. called Resilient Distributed Graph or RDG. GraphX combines
Triple patterns are like RDF triples that each of the subject, the benefits of graph-parallel and data-parallel systems, as it
predicate and object may be a variable or a literal. Solutions efficiently expresses graph computations within the framework
to the variables are then found by matching the patterns in of the data-parallel system. Spark SQL [3] is Spark’s interface
the query to triples in the dataset. Thus, SPARQL queries are for working with structured and semi-structured data. It enables
pattern matching queries on triples, that compose an RDF data querying on data stored in dataframes using SQL. It also
graph. provides an optimizer, Catalyst, which is claimed to improve
the execution of queries.
Specifically, a SPARQL query consists of three parts.
The pattern matching part, which includes several features As such, when studying the RDF processing approaches
of pattern matching of graphs, like optional parts, union of on Apache Spark, the key factors are: a) the data model that
patterns, nesting, filtering (or restricting) values of possible is selected in order to process the RDF data and b) the Spark
matchings. The solution modifiers, which once the output of data abstractions each work decided to rely the implementation
the pattern has been computed (in the form of a table of values on.
of variables), allows to modify these values applying classical
• Data Model: The model selected for the specific
operators, like projection, distinct, order, limit, and offset.
representation of the RDF data. It can be one of the
Finally, the output of a SPARQL query can be of different
following:
types: yes/no answers, selections of values of the variables
a. The Triple Model. RDF data are stored and pro-
which match the patterns, construction of new triples from
cessed in their natural form, as triples that contain
these values, and descriptions of resources.
subject, predicate, object.
According to the position of the variables in the triple b. The Graph Model. The RDF model is represented
patterns, a query can have different shapes that affect its as a directed labeled graph in which, for example, the
performance. Star-shaped patterns/queries are characterized by triple (s hasProperty p) can be interpreted as an edge
subject-subject joins between triple patterns as the join variable labeled with hasProperty from node s to node p. This
is on the subject position. Linear shaped patterns/queries are model is used mainly by systems that are build on top
made of subject-object (or object-subject) joins, for example, of the graph processing API of Spark.
the join variable is on the object position in one triple pattern • Apache Spark Abstraction: Spark provides various
and on the subject position in the other. Snowflake-shaped libraries and data abstractions each of them having
patterns/queries are combinations of several star-shaped con- several advantages and disadvantages.
nections. Finally, more complex queries combine the above a. RDD. RDDs provide a low-level API that gives
described patterns. great control over the dataset. It lacks the schema
control, but gives greater flexibility when it comes
to storage and partition, as it gives the choice of
III. E VALUATION D IMENSIONS implementing a custom partitioner.
b. Dataframes. A DataFrame is an immutable dis-
Apache Spark [22] is an in-memory distributed computing tributed collection of data that is organized into named
platform designed for large-scale data processing. Spark was columns. Designed to make large datasets processing
originally developed at UC Berkeley in 2009 and currently is even easier, allowing developers to impose a structure
one of the most active big-data Apache projects. It can be con- onto a distributed collection of data.
sidered as a main-memory extension of the MapReduce model c. Spark SQL. It enables querying on structured data
[10], since both of them are enabling parallel computations
stored in dataframes using SQL and provides an
on comodity machines with locality-awareness scheduling,
optimizer for improving execution times.
fault tolerance and load balancing. Because of Spark’s main
d. GraphX. This is Spark’s library for graph pro-
memory implementation, it can be up to 100 times faster than cessing. By combining both graph-parallel and data-
Hadoop. This level of efficiency is due to the two main data parallel processing, it can achieve great performance
abstractions that Spark provides: RDDs (Resilient Distributed and flexibility. It also comes with well known graph
Dataset) and Dataframes. RDD was the primary user-facing processing algorithms, like pagerank, triangle counting
API in Spark since its inception. At its core, an RDD is an and shortest paths computation.
immutable distributed collection of data elements, partitioned e. Graphframes. This is the newest graph processing
across nodes in a cluster that can be operated in parallel with a API that benefits from the scalability and high per-
low-level API that offers transformations and actions. Like an
formance of Dataframes. In contrast with GraphX, it
RDD, a DataFrame is an immutable distributed collection of
supports also queries over graphs. It is not yet an
data. Unlike an RDD, data is organized into named columns,
official part of Apache Spark, but comes as a side
like a table in a relational database. By using dataframes, Spark package.
leverages this schema knowledge, and ends up in a much more
efficient data encoding than java serialization. Figure 1 summarizes the different dimensions based on
as graphs, and queries are evaluated directly over them. Either
GraphX or GraphFrames is used for query processing.

A. Triple Processing Systems


1) RDD Implementation: HAQWA [7] was the first ap-
proach that tries to process RDF data on Apache Spark. It
proposes a trade-off between data distribution complexity and
Fig. 1. A taxonomy presenting the dimensions for organizing RDF query query answering efficiency. System’s fragmenation and allo-
processing methods. cation is a two-step procedure. In the first step, a hash-based
partitioning is performed on triple subjects. This fragmentation
ensures that star-shaped queries are performed locally, but no
which we study RDF query processing methods. guarantees are provided for other query types. In the second
step, data are allocated according to the analysis of frequent
Besides the aforementioned dimensions for categorizing
queries executed over the dataset. Then, at query time, the
the works in the area, there are also a number of interesting
system decomposes a query pattern into a set of local sub-
dimensions according to which we could further study them.
queries that can be evaluated locally. Each of those sub-queries
Specifically:
is a candidate to be the starting point (seed query) to evaluate
• Query Processing: This dimension identifies the pro- the entire query pattern. To prevent network communication,
cedure for translating a SPARQL query into a query the missing triples are replicated into the partitions that contain
compatible to the Spark format and how does the the triples of the seed. To do so, for each candidate and parti-
query get evaluated over the dataset. For example, a tion, HAQWA computes the cost of transferring missing triples
SPARQL query can be translated into SQL code, and into the current partition. HAQWA performs an encoding of
execute this code using Spark SQL. string values to integer ones on data, which minimizes data
volume and makes processing more efficient. Query processing
• Query Processing Optimizations: This dimension is based on a mapping from SPARQL to RDDs API, like join,
describes the optimization methods employed by a filter and count.
selected system. For example, a very common way
for query optimization is to re-order the joins sequence In SPARQLGX [13], RDF datasets are vertically parti-
based on data statistics. tioned. As such, a triple (s p o) is stored in a file named p
whose content keeps only s and o entries. By following this
• Data Partitioning: Choosing the right data partition approach, the memory footprint is reduced and the response
strategy is essential in distributed systems. The goal time is minimized when queries have bounded predicates. The
is to maximize data locality and minimize network query translation is done by parsing one by one the triple
communication to achieve the desirable performance. patterns and mapping them to Spark’s RDD API. In order to
Apache Spark uses by default a hash partitioning deal with a group of triple patterns, the result of each sub-query
strategy, but this can be modified depending on the is joined with the next one having a common variable with it,
data abstraction that is used. using this common variable as a key (keyBy in Spark). If no
• SPARQL Fragment: SPARQL contains a huge set of common variable is found, between two triple patterns, then
operations and most of the systems do not provide full the cross procuct is computed. SPARQLGX is able to evaluate
support for it. All systems start from evaluating simple Basic Graph Pattern (pBGP) queries and also operations like
blocks of triple patterns, called Basic Graph Patterns DISTINCT, SORT, UNION, OPTIONAL and FILTER. As an
(BGP), and continue building on top of this, for more optimization, statistics on data are computed in order to reorder
operations (BGP+), such as average (AVG) and filter the join execution of each query. More specifically, the system
operations (FILTER). counts all distinct subjects, predicates and objects of the given
dataset.
• System contribution: The main focus of most of
the systems is to improve query performance. Some 2) Spark SQL: S2RDF [19] is an efficient and scalable
systems focus on a particular query types, e.g., star system on top of Spark that aims to provide improvements
queries, and others target at handling multiple or all for all query types. This work presents an extended version of
query types. the classic vertical partitioning technique, called ExtVP. Each
ExtVP table is a set of sub-tables corresponding to a vertical
IV. RDF P ROCESSING A PPROACHES partition (VP) table. The sub-tables are generated by using
right outer joins between VP tables. More specifically, [19]
In this section, we organize systems based on the way pre-computes semi-join reductions for subject-subject (SS),
they model and process RDF data. Specifically, we distinguish object-subject (OS) and subject-object (SO). For SPARQL
between i) triple processing systems and ii) graph processing query execution, the triples are joined via shared variables.
systems. In triple processing systems, data are loaded as triples For example, for the triple patterns ?x likes ?y and ?x follows
and their raw form is used for further processing. Usually in ?z the ?x variable is used for joining them. Assuming that
such systems, a simple partitioning technique, like hash or there are two tables containing 100 entries each, having only
vertical partitioning, is preferred whereas for the evaluation of 10 entries in the same subject, we need 10,000 operations
the issued SPARQL queries, the RDD API or Spark SQL is to join them. If we store data using ExtVP, the operations
used. In graph processing systems, the RDF data are modeled can be decreased to 10 and as such, the efficiency of the
query is enhanced. For query processing, S2RDF uses Jena B. Graph Processing
ARQ to tranform a SPARQL query to an algebra tree and
1) GraphX: S2X [18] is a work that combines the graph-
then it traverses this tree to produce a Spark SQL query. To
parallel abstraction of GraphX with the data-parallel com-
reduce the storage overhead of the extra sub-tables they use
putaion of Spark to evaluate SPARQL queries in a distributed
a selectivity factor (SF). This SF defines the relative size of
manner. GraphX is used to implement the graph pattern
ExtVP of a table compared to the corresponding VP table size.
matching part of SPARQL and the data-parallel computaion
So, S2RDF supports the definition of a threshold for SF such
of Spark to implement other SPARQL operators.
that all ExtVP tables above this threshold are not considered.
As a query optimization, an algorithm that reorders sub-query RDF data are being modeled as a property graph (for more
execution based on the table size and the number of bounded in property graphs, see [20]). In a property graph each vertex
variables is used. Sub-queries with the most bounded variables has an ID and properties and edges have a property and two
are executed first, and for those with same number of bounded IDs of the corresponding vertices. Edge property stores the
variables the one that corresponds to the smallest table size is predicate U RI. Vertex properties are used to store subject and
picked. S2RDF support SPARQL BGPs and also operations, object U RIs, and a data structure for candidate query variables
like FILTER, UNION, OFFSET, LIMIT and ORDER BY. that could match this vertex. The basic idea of the proposed
algorithm is that every vertex in the graph stores the variables
3) Hybrid Approaches: [17] studies two distributed join of a query where it is a possible candidate for. The first step is
algorithms, partitioned join and broadcast join, for the evalua- to match all triple patterns of a BGP independently, and then
tion of BGP expressions on top of Apache Spark. In this work, exchange messages between adjacent vertices to validate the
we see what kind of join algorithm each data abstraction of match candidates until they do not change anymore. The set of
Apache Spark uses, and how we can combine them to achieve matches for each vertex is called local match and the matched
better performance. For the purpose of this study, all data sets of adjacent vertices are called remote matches.
are partitioned using hash-based partitioning on their subject. More specifically, all possibly relevant vertices are deter-
For every data abstraction of Spark, the authors implement a mined by matching each edge with all triple patterns from
translation from SPARQL to the corresponding API, in order the BGP. Match candidates are validated according to some
to execute queries on RDF by exploiting the framework. validation rules, using local and remote match sets and invalid
ones get discarded. Locally changed match sets are sent to their
Spark SQL uses the embedded Catalyst optimizer to gener- neighbors in the graph for validation in the next step. The same
ate the execution plan of the query, using the Spark Dataframe process is repeated until no changes occur. The final output is
and the broadcast join algorithm. A significant drawback of composed of the individual subgraphs of the previous steps.
this approach is that when one query has more than one triple S2X can evaluate also SPARQL opertaros, like OPTIONAL,
patterns (almost always) a Cartesian product is being used FILTER, ORDER BY, PROJECTION, LIMIT and OFFSET.
instead of a join, which is inefficient. The RDD approach These operators are implemented with the use of Spark API.
translates each join into a partitioned join operator, following [16] introduces as well an approach that is based on
the order specified by the input logical query. This ends up subgraph matching on GraphX. Here, each vertex is assigned
with a sequence of (possibly n-ary) joins on different variables. with 3 properties: 1) a label that keeps the value of its
This approach lacks efficiency when a broadcast join is cheaper corresponding subject or object, 2) a Match Track table (M T)
e.g., join a small with a large data set. It is worth mentioning that contains variables and constants, and 3) a flag that shows if
that RDDs always reads the entire data set for each triple a vertex is located at the end of a path (sequence of matched
pattern. Data frames provide an important benefit which comes BGP triples). Edges have a property called edge label that
from the columnar compressed in-memory representation that keeps the predicate value.
is uses. Up to 10 times larger data sets than RDD can be
managed. It uses a cost-based join optimization approach by The proposed algorithm iterates through all BGP triples
preferring a single broadcast join to a sequence of partitioned of a SPARQL query. Graph matching is being implemented
join if the dataset is smaller than a given threshold. For with the use of AggregateMessages operator of GraphX that
example, in the case of joining several small datasets with provides two functions, sendMsg and a mergeMsg. SendMsg
a large one this approach is more efficient. In cases that can be considered as a map function that matches the current
join expressions that are highly selective filtering over a large BGP triple with all graph triples. If a match is found, the
dataset, this approach will not use the most efficient join sendMsg prepares and sends different messages to the source
because it takes into account only the size. Also this approach and destination vertex of the triple. Then, using the mergeMsg
does not consider data partitioning. as a reduce function, the received messages are aggregated at
their destination vertex. At the last step in each iteration, the
joinVertices function is used to evaluate the old property values
Trying to overcome the limitations of the previous ap-
and the new values in each vertex. After evaluating all BGP
proaches, [17] offers a hybrid strategy that combines broadcast
triples, we join the final M T tables of the end vertices, which
joins with partitioned joins. More specifically, it takes into
contain partial results, to generate the final query answer.
account an existing data partitioning scheme to avoid useless
data transfer and use data compression from data frames Spar(k)ql [12] also targets at evaluating SPARQL queries
to reduce the data access cost of self-join operations. The over GraphX. In addition, RDF data are modeled as a property
most efficient query plan for the combination of the two join graph. The node model adopted is pretty simple, objects
algorithms is generated by a dynamic greedy optimization properties are the edges of the graph and data properties are
algorithm based on data statistics. stored in the nodes of the graph as nodes properties. An
exception to this is the rdf : type property. Although, it is an The Memory Data Model, RDSG (Resilient Discreted
object property, due to its popularity in SPARQL queries, it is Semantic SubGraph) is a distributed memory abstraction that
stored in the node properties along with the data properties. enables in-memory query computations on large clusters. This
model provides basic operations, like RDSG generation, filter,
In order to implement the query answering via vertex
prepartition and join. These operations are based on the Spark
programs, there is a need to store sub-results in tables in
API and are used to implement system’s query processing.
each node. The main idea is that each node get messages
from its neighboors and calculates the sub-results based on Regarding query processing, every query is decomposed
the incoming messages and the stored information. For this into an ordered sequence of variables and every query variable
reason, it performs a M ap phase with the query variables is made up of several triple patterns. For example, for the
as keys, and data tables as values, that contain possible sub- variable X, the authors compute the matches for each triple
results. Furthermore, Spar(k)ql provides a message model that pattern for this variable, and the matched triples are used to
allows all edges to be active until they get processed, so that find the matches on the next triple pattern that X exists, by
all type of queries will be able to be executed. A query plan joining them on the shared variable X. After this procedure
is generated by exploiting a breadth-first search algorithm that finishes, the process continues with the next variable.
uses object properties to create a tree. During the execution,
the query plan is traversed bottom-up and, for each node, it As query optimizations, variable’s class is passed through
iterates through the edges to find the corresponding matches. message to the corresponding triple patterns that contain the
variable. By following this method, the authors avoid reading
2) GraphFrames: [4] is the first work that implements many unnecessary data, and rdf : type triple patterns can
an efficient processing technique for RDF data over the be removed. On-demand, dynamic pre-partitioning is applied
Graphframes API [9]. It is a new graph processing platform to reduce the shuffling cost in the distributed join process.
created over Apache Spark, using the concept of Dataframes. Specifically, this process pre-partitions the MESG only when
In this approach, the input dataset splits into two separate it is on-demand loaded into the distributed memory. This
lists, a nodelist and an edgelist, which are used to generate pre-partitioning scheme guarantees that the records sharing
the unweighted labeled graph. SPARQL queries are translated the same variable value will be read into the same partition.
into query graphs which are then being optimized to improve Finally, an optimal query plan is generated that first determines
performance. To determine an optimal order of the query, the joining order of variables and then the order of triple
the algorithm takes into account the predicate frequency, and patterns in a job.
sorts sub-queries in non-descending order. In the next step,
another optimization takes place called local search space
pruning. In this procedure, for each query all triples in the V. D ISCUSSION & C ONCLUSIONS
dataset that do not match BGPs predicates get discarded. This In short, we can categorize the RDF query processing
technique results in a new graph created from this temporary approaches on Apache Spark based on the following dimen-
dataset, which has a much smaller search space. Finally, query sions: how the data are modeled in order to process them
processing takes the optimized query and the locally pruned (data model), and which is the Spark API that is used for
RDF data-graph, and performs subgraph matching to get the the implementation of the approach (Spark Abstraction). RDF
final query answer. data are stored and processed in their natural form, as triples, or
3) Hybrid Approaches: SparkRDF [5] is an elastic graph are represented as a directed labeled graph. RDDs, Dataframes,
processing engine that is scalable, efficient, reduces I/O and Spark SQL, GraphX and Graphframes are the APIs provided
intermediate communication and is build on top of Spark, by Spark. RDDs give great flexibility regarding storage and
without the use of a graph processing API. The SparkRDF partitioning, while Dataframes offer an immutable distributed
approach presents a novel storage scheme for managing big collection of data organized into named columns. When data
RDF graphs in HDFS, an iterative graph model for processing are stored in Dataframes, Spark SQL can be used for opti-
SPARQL queries distributively and in-memory. Several opti- mized query processing. GraphX supports graph-parallel and
mization techniques were proposed, including an optimal query data-parallel data processing, while, Graphframes, in addition,
plan and a dynamic partitioning method. support queries over graphs. Table I summarizes the various
options in each dimension. Overall, the graph representation
Multi-layer Elastic Sub-graph (MESG) is the storage model model is the one that used mainly by systems that are build
created for this work. MESG consists of three level of indexes. on top of the graph processing API of Spark.
In the first level, we have a class index and a relation index.
The relation index is for triples that do not have an rdf : type Generally speaking, we observe that there is a trend around
predicate and the class index is for those triples that have. using Apache Spark when targeting at efficient RDF query
Relation files are stored by predicate name and class files are processing approaches. Table II provides some additional
stored by the name of object. In the second level, MESG characteristics of those approaches. Mainly, the ultimate goal
uses more information for indexing than only the predicate. of all those approaches is to improve query performance by
It divides predicates files according to the type of subjects exploiting data parallelization. However, to this purpose they
and objects. So, we have CR (class-relation) and RC (relation- neglect that data partitioning is a key element of efficient query
class) indexes. In the third level, it goes one step further and processing that has a huge impact in query answering. As
creates an index that combines every part of the triple. CRC such they end up using simple partitioning techniques like
(class-relation-class) index uses subjects class, predicate and vertical or hash partitioning. Although, some recent works have
objects class in order to exploit all the information that may already started to recognize the importance of data partitioning
be available for a triple. (e.g. [19] that presents a sophisticated partitioning technique
TABLE I. A TAXONOMY OF THE RDF Q UERY P ROCESSING A PPROACHES WITH RESPECT TO DATA M ODEL AND A PACHE S PARK A BSTRACTION .
Data Model
The Triple Model The Graph Model
RDD [7], [13],[17] [5]
Dataframes [17]
Apache Spark Abstraction
Spark SQL [19]
GraphX [18], [16], [12]
Graphframes [4]

TABLE II. A DDITIONAL C HARACTERISTICS OF THE RDF Q UERY P ROCESSING A PPROACHES .

System Query Processing Query Processing Optimization Partitioning SPARQL


[7] RDD API No Hash-subj / query aware BGP+
[13] RDD API Yes Vertical BGP+
[19] Spark SQL API Yes Extended Vertical BGP+
[17] Hybrid Yes Hash-sbj BGP
[18] Graph iterations No Default BGP+
[16] Graph Iterations Yes Default BGP
[12] Graph Iterations Yes Default BGP
[4] Subgraph Matching Yes Default BGP
[5] Custom Yes Hash-sbj BGP

based on the classical vertical partitioning method), we argue [7] Curé, O., Naacke, H., Baazizi, M.A., Amann, B.: HAQWA: a hash-
that data partitioning is an essential part of efficient query based and query workload aware distributed RDF store. In: Interna-
processing and that further research is required in the area. tional Semantic Web Conference (Posters & Demos), CEUR Workshop
Proceedings, vol. 1486. [Link] (2015)
To this direction, exploiting knowledge about the queries [8] Curé, O., Naacke, H., Baazizi, M.A., Amann, B.: On the evaluation
previously submitted in a system, we can end up in a more of RDF distribution algorithms implemented over apache spark. In:
efficient partitioning scheme. The goal of such a scheme SSWS@ISWC, CEUR Workshop Proceedings, vol. 1457, pp. 16–31.
[Link] (2015)
would be to handle efficiently the query types that are mostly
[9] Dave, A., Jindal, A., Li, L.E., Xin, R., Gonzalez, J., Zaharia, M.:
submitted to the system improving overall the efficiency of Graphframes: an integrated API for mixing graph and relational queries.
the system. [7] proposes a partitioning procedure towards In: International Workshop on Graph Data Management Experiences
this direction. Specifically, it exploits particular knowledge and Systems, p. 2 (2016)
regarding the input queries in order to ensure data locality in [10] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large
frequent queries. Graph partitioning does not focus on load clusters. Commun. ACM 51(1), 107–113 (2008)
balancing rather than on minimizing the edge-cut between [11] Faye, Cure, O., Blin: A survey of RDF storage approaches. ARIMA
partitions. GraphX has not been exploited yet towards this Journal 15, 11–35 (2012)
direction and could be an option to build such algorithms, as [12] Gombos, G., Rácz, G., Kiss, A.: Spar(k)ql: SPARQL evaluation method
on spark graphx. In: FiCloud Workshops, pp. 188–193. IEEE Computer
it offers already an extensive amount of graph algorithms. Society (2016)
In a different direction, dynamicity is an indispensable part [13] Graux, D., Jachiet, L., Genevès, P., Layaı̈da, N.: SPARQLGX: efficient
of the RDF data, which are constantly evolving, typically with- distributed evaluation of SPARQL with apache spark. In: International
Semantic Web Conference (2), Lecture Notes in Computer Science, vol.
out any warning, centralized monitoring, or reliable notification 9982, pp. 80–87 (2016)
mechanism. This raises the need to keep track of the different
[14] Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large
versions of the data, so as to be able to have access not only RDF graphs. PVLDB 4(11), 1123–1134 (2011)
to the latest version, but also to previous ones. This way, [15] Kaoudi, Z., Manolescu, I.: RDF in the clouds: a survey. VLDB J. 24(1),
it is crucial for the next generation of parallel RDF query 67–91 (2015)
answering approaches, to be able to handle evolving data, in [16] Kassaie, B.: SPARQL over graphx. CoRR abs/1701.03091 (2017)
an uninterrupted manner. [17] Naacke, H., Amann, B., Curé, O.: SPARQL graph pattern processing
R EFERENCES with apache spark. In: GRADES@SIGMOD/PODS, pp. 1:1–1:7. ACM
(2017)
[1] RDF Schema 1.1. Available online: [Link] [18] Schätzle, A., Przyjaciel-Zablocki, M., Berberich, T., Lausen, G.:
(last accessed October 2017) S2X: graph-parallel querying of RDF with graphx. In: Big-
[2] SPARQL Query Language for RDF. Available online: O(Q)/DMAH@VLDB, Lecture Notes in Computer Science, vol. 9579,
[Link] pp. 155–168. Springer (2015)
[3] Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., [19] Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF:
Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark RDF querying with SPARQL on spark. PVLDB 9(10), 804–815 (2016)
SQL: relational data processing in spark. In: SIGMOD, pp. 1383–1394 [20] Sun, W., Fokoue, A., Srinivas, K., Kementsietsidis, A., Hu, G., Xie,
(2015) G.T.: Sqlgraph: An efficient relational-based property graph store. In:
[4] Bahrami, R.A., Gulati, J., Abulaish, M.: Efficient processing of SIGMOD, pp. 1887–1901 (2015)
SPARQL queries over graphframes. In: WI, pp. 678–685. ACM (2017) [21] Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient
[5] Chen, X., Chen, H., Zhang, N., Zhang, S.: Sparkrdf: Elastic discreted distributed graph system on spark. In: GRADES (2013)
RDF graph processing engine with distributed memory. In: WI-IAT (1), [22] Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.:
pp. 292–300. IEEE Computer Society (2015) Spark: Cluster computing with working sets. In: HotCloud (2010)
[6] Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in
the Web of Data. Synthesis Lectures on the Semantic Web: Theory and
Technology. Morgan & Claypool Publishers (2015)

You might also like