Practical Apache Spark in GraphX
Practical Apache Spark in GraphX
Bash:
./bin/spark-shell
The first and the second arguments indicate the source and
the destination vertices identifiers and the third argument
means the edge property which, in our case, is the distance
between corresponding cities in kilometers.
Filtration by vertices
To illustrate the filtration by vertices let’s find the cities with
population more than 50000. To implement this, we will use
the filter operator:
graph.vertices.filter {case (id, (city, population)) => population
> 50000}.collect.foreach {case (id, (city, population))
=>println(s”The population of $city is $population”)}
Triplets
One of the core functionalities of GraphX is exposed through
the triplets RDD. There is one triplet for each edge which
contains information about both the vertices and the edge
information. Let’s take a look through graph.triplets.collect.
Filtration by edges
Now, let’s consider another type of filtration, namely
filtration by edges. For this purpose, we want to find the
cities, the distance between which is less than 150
kilometers. If we type in the spark shell,
graph.edges.filter {case Edge(city1, city2, distance) => distance <
150}.collect.foreach {case Edge(city1, city2, distance)
=>println(s”The distance between $city1 and $city2 is $distance”)}
Aggregation
Another interesting task which can be considered here is
aggregation. We will find total population of the neighboring
cities. But before we start, we should change our graph a
little. The reason for this is that GraphX deals only with
directed graphs. But to take into account edges in both
directions, we should add the reverse directions to the
graph. Let’s take a union of reversed edges and original
ones.
val undirectedEdgeRDD = graph.reverse.edges.union(graph.edges)val
graph = Graph(verRDD, undirectedEdgeRDD)
Now we have an undirected graph with all the edges and
directions taken into account, so we can perform the
aggregation using aggregateMessages operator:
val neighbors = graph.aggregateMessages[Int](ectx =>
ectx.sendToSrc(ectx.dstAttr._2), _ + _)
Conclusions
GraphX is very useful Spark component which has many
applications in different fields, from computer science to
biology and social sciences. In this post, we have considered
the simple graph example where the vertices are the cities
and the edges are the distances between them. Some basic
operations such as filtration by vertices, filtration by edges,
operations with triplets and aggregation have been applied
to this graph. All in all, we showed that Apache Spark
GraphX component is very convenient and applicable for
graph computations.