RDF Querying with Apache Spark Review
RDF Querying with Apache Spark Review
Giannis Agathangelos1 , Georgia Troullinou1 , Haridimos Kondylakis1 , Kostas Stefanidis2 , Dimitris Plexousakis1
1
ICS-FORTH, Greece {jagathan, troulin, kondylak, dp}@[Link]
2
University of Tampere, Finland [Link]@[Link]
Abstract—The explosion of the web and the abundance of systems are beyond the scope of the former, as the authors
linked data demand for effective and efficient methods for claim in the first paper, whereas they both cover mainly works
storage, management and querying. More specifically, the ever- before the prevalence of Spark. As such, our work can be seen
increasing size and number of RDF data collections raises the as complementary to the aforementioned surveys, shedding
need for efficient query answering, and dictate the usage of light to the area of RDF query answering, specifically, on
distributed data management systems for effectively partitioning
and querying them. To this direction, Apache Spark is one
works using Apache Spark as the underlying data management
of the most active big-data approaches, with more and more infrastructure. From a different perspective, [8] presents a
systems adopting it, for efficient, distributed data management. preliminary experimental comparison, evaluating Spark im-
The purpose of this paper is to provide an overview of the existing plementations for RDF systems, focusing on techniques for
works dealing with efficient query answering, in the area of RDF distributing data. Specifically, the authors analyze five repre-
data, using Apache Spark. We discuss on the characteristics and sentative RDF data distribution approaches. The purpose there
the key dimension of such systems, we describe novel ideas in the is to provide a clear view about the most efficient distribution
area, and the corresponding drawbacks, and provide directions solution, for a given context among the evaluated solutions,
for future work. and to show the challenges, each approach has, when it comes
to Spark implementation. However, in our paper we do not
I. I NTRODUCTION attempt to do a comparative experimental evaluation to a
limited number of approaches, but to identify and present an
The prevalence of Open Linked Data, and the explosion overview of the main research directions in the area.
of available information on the Web, have led to an enormous
amount of widely available RDF datasets [6]. To store, manage The rest of this paper is structured as follows: In Section
and query these ever increasing RDF data, many systems II, we present some background required when speaking about
have been developed by the research community and by RDF data. Then, in Section II, we define the dimensions we
commercial vendors. To this direction, distributed big data use for describing the systems presented in Section IV. Finally,
processing engines, like Hadoop, HBase and Impala [14], are Section V concludes this paper, identifies gaps in the area and
exploited more and more for this purpose due to their ability to presents directions for future work.
effectively handle mass amounts of data. Apache Spark, is one
of the most active, big data approach with an ever increasing
interest in using it for efficient query answering over RDF data.
The platform uses in-memory data structures that can be used II. BACKGROUND
to store RDF data, offering increasing efficiency, and enabling
effective distributed query answering. A. The Resource Description Framework (RDF)
As such, the goal of this work is to provide an overview The representation of knowledge in RDF is based on triples
of the works dealing with efficient query answering, using of the form of (subject predicate object) which record that
Apache Spark, for RDF data. Focusing on this specific field, subject is related to object via predicate. Formally, represen-
we fill in the gap in the literature, providing a complete tation of RDF data is based on three disjoint and infinite sets of
and detailed overview of the current research activities in the resources, namely: URIs (U ), literals (L) and blank nodes (B).
area. More specifically, our contributions are the following. RDF allows representing a form of incomplete information
Firstly, we present and discuss various dimensions of analysis, through blank nodes, standing for unknown constants or URIs.
identifying key elements for such systems. Then, we classify As such, a triple is a tuple (subject predicate object) from
the approaches according to the data model and the apache (U ∪B)×U ×(U ∪L∪B). In addition, to state that a resource
Spark abstraction they use. We proceed further to perform an in r is of a type ,the property rdf : type is used.
depth overview of the approaches in each category, providing a
unique perspective on the research in the area, and highlighting RDF datasets have attached semantics through RDFS [1], a
the novel ideas and the drawbacks of each one. Finally, we vocabulary description language. RDF Schema is a vocabulary
identify what is missing from the area and provide interesting description language that includes a set of inference rules used
directions for future work. to generate new, implicit triples from explicit ones.
There are already surveys in the area of generic RDF Finally, a collection of triples can be represented as a
storage [11] and on RDF data management systems in cloud labeled directed graph, in which nodes represent subjects or
environments [15]. However, distributed RDF query answering objects and labeled directed edges represent predicates.
B. Querying On top of RDD and DataFrames, Spark proposes two
higher-level data access models, GraphX and Spark SQL, for
For querying RDF data, SPARQL is used. SPARQL [2] processing semi-structured data in general. Those data models
is currently the standard query language for the semantic web can be used to handle RDF data and SPARQL queries. Spark
and has become an official W3C recommendation. Essentially, GraphX [21] is a library enabling graph processing by extend-
SPARQL is a graph-matching language. SPARQL queries ing the RDD abstraction and hence introduces a new feature
contain a set of triples patterns, also called basic graph patterns. called Resilient Distributed Graph or RDG. GraphX combines
Triple patterns are like RDF triples that each of the subject, the benefits of graph-parallel and data-parallel systems, as it
predicate and object may be a variable or a literal. Solutions efficiently expresses graph computations within the framework
to the variables are then found by matching the patterns in of the data-parallel system. Spark SQL [3] is Spark’s interface
the query to triples in the dataset. Thus, SPARQL queries are for working with structured and semi-structured data. It enables
pattern matching queries on triples, that compose an RDF data querying on data stored in dataframes using SQL. It also
graph. provides an optimizer, Catalyst, which is claimed to improve
the execution of queries.
Specifically, a SPARQL query consists of three parts.
The pattern matching part, which includes several features As such, when studying the RDF processing approaches
of pattern matching of graphs, like optional parts, union of on Apache Spark, the key factors are: a) the data model that
patterns, nesting, filtering (or restricting) values of possible is selected in order to process the RDF data and b) the Spark
matchings. The solution modifiers, which once the output of data abstractions each work decided to rely the implementation
the pattern has been computed (in the form of a table of values on.
of variables), allows to modify these values applying classical
• Data Model: The model selected for the specific
operators, like projection, distinct, order, limit, and offset.
representation of the RDF data. It can be one of the
Finally, the output of a SPARQL query can be of different
following:
types: yes/no answers, selections of values of the variables
a. The Triple Model. RDF data are stored and pro-
which match the patterns, construction of new triples from
cessed in their natural form, as triples that contain
these values, and descriptions of resources.
subject, predicate, object.
According to the position of the variables in the triple b. The Graph Model. The RDF model is represented
patterns, a query can have different shapes that affect its as a directed labeled graph in which, for example, the
performance. Star-shaped patterns/queries are characterized by triple (s hasProperty p) can be interpreted as an edge
subject-subject joins between triple patterns as the join variable labeled with hasProperty from node s to node p. This
is on the subject position. Linear shaped patterns/queries are model is used mainly by systems that are build on top
made of subject-object (or object-subject) joins, for example, of the graph processing API of Spark.
the join variable is on the object position in one triple pattern • Apache Spark Abstraction: Spark provides various
and on the subject position in the other. Snowflake-shaped libraries and data abstractions each of them having
patterns/queries are combinations of several star-shaped con- several advantages and disadvantages.
nections. Finally, more complex queries combine the above a. RDD. RDDs provide a low-level API that gives
described patterns. great control over the dataset. It lacks the schema
control, but gives greater flexibility when it comes
to storage and partition, as it gives the choice of
III. E VALUATION D IMENSIONS implementing a custom partitioner.
b. Dataframes. A DataFrame is an immutable dis-
Apache Spark [22] is an in-memory distributed computing tributed collection of data that is organized into named
platform designed for large-scale data processing. Spark was columns. Designed to make large datasets processing
originally developed at UC Berkeley in 2009 and currently is even easier, allowing developers to impose a structure
one of the most active big-data Apache projects. It can be con- onto a distributed collection of data.
sidered as a main-memory extension of the MapReduce model c. Spark SQL. It enables querying on structured data
[10], since both of them are enabling parallel computations
stored in dataframes using SQL and provides an
on comodity machines with locality-awareness scheduling,
optimizer for improving execution times.
fault tolerance and load balancing. Because of Spark’s main
d. GraphX. This is Spark’s library for graph pro-
memory implementation, it can be up to 100 times faster than cessing. By combining both graph-parallel and data-
Hadoop. This level of efficiency is due to the two main data parallel processing, it can achieve great performance
abstractions that Spark provides: RDDs (Resilient Distributed and flexibility. It also comes with well known graph
Dataset) and Dataframes. RDD was the primary user-facing processing algorithms, like pagerank, triangle counting
API in Spark since its inception. At its core, an RDD is an and shortest paths computation.
immutable distributed collection of data elements, partitioned e. Graphframes. This is the newest graph processing
across nodes in a cluster that can be operated in parallel with a API that benefits from the scalability and high per-
low-level API that offers transformations and actions. Like an
formance of Dataframes. In contrast with GraphX, it
RDD, a DataFrame is an immutable distributed collection of
supports also queries over graphs. It is not yet an
data. Unlike an RDD, data is organized into named columns,
official part of Apache Spark, but comes as a side
like a table in a relational database. By using dataframes, Spark package.
leverages this schema knowledge, and ends up in a much more
efficient data encoding than java serialization. Figure 1 summarizes the different dimensions based on
as graphs, and queries are evaluated directly over them. Either
GraphX or GraphFrames is used for query processing.
based on the classical vertical partitioning method), we argue [7] Curé, O., Naacke, H., Baazizi, M.A., Amann, B.: HAQWA: a hash-
that data partitioning is an essential part of efficient query based and query workload aware distributed RDF store. In: Interna-
processing and that further research is required in the area. tional Semantic Web Conference (Posters & Demos), CEUR Workshop
Proceedings, vol. 1486. [Link] (2015)
To this direction, exploiting knowledge about the queries [8] Curé, O., Naacke, H., Baazizi, M.A., Amann, B.: On the evaluation
previously submitted in a system, we can end up in a more of RDF distribution algorithms implemented over apache spark. In:
efficient partitioning scheme. The goal of such a scheme SSWS@ISWC, CEUR Workshop Proceedings, vol. 1457, pp. 16–31.
[Link] (2015)
would be to handle efficiently the query types that are mostly
[9] Dave, A., Jindal, A., Li, L.E., Xin, R., Gonzalez, J., Zaharia, M.:
submitted to the system improving overall the efficiency of Graphframes: an integrated API for mixing graph and relational queries.
the system. [7] proposes a partitioning procedure towards In: International Workshop on Graph Data Management Experiences
this direction. Specifically, it exploits particular knowledge and Systems, p. 2 (2016)
regarding the input queries in order to ensure data locality in [10] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large
frequent queries. Graph partitioning does not focus on load clusters. Commun. ACM 51(1), 107–113 (2008)
balancing rather than on minimizing the edge-cut between [11] Faye, Cure, O., Blin: A survey of RDF storage approaches. ARIMA
partitions. GraphX has not been exploited yet towards this Journal 15, 11–35 (2012)
direction and could be an option to build such algorithms, as [12] Gombos, G., Rácz, G., Kiss, A.: Spar(k)ql: SPARQL evaluation method
on spark graphx. In: FiCloud Workshops, pp. 188–193. IEEE Computer
it offers already an extensive amount of graph algorithms. Society (2016)
In a different direction, dynamicity is an indispensable part [13] Graux, D., Jachiet, L., Genevès, P., Layaı̈da, N.: SPARQLGX: efficient
of the RDF data, which are constantly evolving, typically with- distributed evaluation of SPARQL with apache spark. In: International
Semantic Web Conference (2), Lecture Notes in Computer Science, vol.
out any warning, centralized monitoring, or reliable notification 9982, pp. 80–87 (2016)
mechanism. This raises the need to keep track of the different
[14] Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large
versions of the data, so as to be able to have access not only RDF graphs. PVLDB 4(11), 1123–1134 (2011)
to the latest version, but also to previous ones. This way, [15] Kaoudi, Z., Manolescu, I.: RDF in the clouds: a survey. VLDB J. 24(1),
it is crucial for the next generation of parallel RDF query 67–91 (2015)
answering approaches, to be able to handle evolving data, in [16] Kassaie, B.: SPARQL over graphx. CoRR abs/1701.03091 (2017)
an uninterrupted manner. [17] Naacke, H., Amann, B., Curé, O.: SPARQL graph pattern processing
R EFERENCES with apache spark. In: GRADES@SIGMOD/PODS, pp. 1:1–1:7. ACM
(2017)
[1] RDF Schema 1.1. Available online: [Link] [18] Schätzle, A., Przyjaciel-Zablocki, M., Berberich, T., Lausen, G.:
(last accessed October 2017) S2X: graph-parallel querying of RDF with graphx. In: Big-
[2] SPARQL Query Language for RDF. Available online: O(Q)/DMAH@VLDB, Lecture Notes in Computer Science, vol. 9579,
[Link] pp. 155–168. Springer (2015)
[3] Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., [19] Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF:
Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark RDF querying with SPARQL on spark. PVLDB 9(10), 804–815 (2016)
SQL: relational data processing in spark. In: SIGMOD, pp. 1383–1394 [20] Sun, W., Fokoue, A., Srinivas, K., Kementsietsidis, A., Hu, G., Xie,
(2015) G.T.: Sqlgraph: An efficient relational-based property graph store. In:
[4] Bahrami, R.A., Gulati, J., Abulaish, M.: Efficient processing of SIGMOD, pp. 1887–1901 (2015)
SPARQL queries over graphframes. In: WI, pp. 678–685. ACM (2017) [21] Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient
[5] Chen, X., Chen, H., Zhang, N., Zhang, S.: Sparkrdf: Elastic discreted distributed graph system on spark. In: GRADES (2013)
RDF graph processing engine with distributed memory. In: WI-IAT (1), [22] Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.:
pp. 292–300. IEEE Computer Society (2015) Spark: Cluster computing with working sets. In: HotCloud (2010)
[6] Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in
the Web of Data. Synthesis Lectures on the Semantic Web: Theory and
Technology. Morgan & Claypool Publishers (2015)