Hadoop-Gremlin

<dependency>
   <groupId>org.apache.tinkerpop</groupId>
   <artifactId>hadoop-gremlin</artifactId>
   <version>x.y.z</version>
</dependency>

Hadoop is a distributed computing framework that is used to process data represented across a multi-machine compute cluster. When the data in the Hadoop cluster represents a TinkerPop graph, then Hadoop-Gremlin can be used to process the graph using both TinkerPop’s OLTP and OLAP graph computing models.

Important

This section assumes that the user has a Hadoop 3.x cluster functioning. For more information on getting started with Hadoop, please see the Single Node Setup tutorial. Moreover, if using SparkGraphComputer it is advisable that the reader also familiarize their self with and Spark (Quick Start).

Installing Hadoop-Gremlin

If using Gremlin Console, it is important to install the Hadoop-Gremlin plugin. Note that Hadoop-Gremlin requires a Gremlin Console restart after installing.

$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(4)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :install org.apache.tinkerpop hadoop-gremlin x.y.z
==>loaded: [org.apache.tinkerpop, hadoop-gremlin, x.y.z] - restart the console to use [tinkerpop.hadoop]
gremlin> :q
$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(4)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :plugin use tinkerpop.hadoop
==>tinkerpop.hadoop activated
gremlin>

It is important that the CLASSPATH environmental variable references HADOOP_CONF_DIR and that the configuration files in HADOOP_CONF_DIR contain references to a live Hadoop cluster. It is easy to verify a proper configuration from within the Gremlin Console. If hdfs references the local file system, then there is a configuration issue.

gremlin> hdfs
==>storage[org.apache.hadoop.fs.LocalFileSystem@65bb9029] // BAD

gremlin> hdfs
==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1229457199_1, ugi=user (auth:SIMPLE)]]] // GOOD

The HADOOP_GREMLIN_LIBS references locations that contain jars that should be uploaded to a respective distributed cache (YARN or SparkServer). Note that the locations in HADOOP_GREMLIN_LIBS can be colon-separated (:) and all jars from all locations will be loaded into the cluster. Locations can be local paths (e.g. /path/to/libs), but may also be prefixed with a file scheme to reference files or directories in different file systems (e.g. hdfs:///path/to/distributed/libs). Typically, only the jars of the respective GraphComputer are required to be loaded.

Properties Files

HadoopGraph makes use of properties files which ultimately get turned into Apache configurations and/or Hadoop configurations.

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.inputLocation=tinkerpop-modern.kryo
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
gremlin.hadoop.outputLocation=output
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer
####################################
# Spark Configuration              #
####################################
spark.master=local[4]
spark.executor.memory=1g
spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
gremlin.spark.persistContext=true

A review of the Hadoop-Gremlin specific properties are provided in the table below. For the respective OLAP engines (SparkGraphComputer refer to their respective documentation for configuration options.

Property	Description
gremlin.graph	The class of the graph to construct using GraphFactory.
gremlin.hadoop.inputLocation	The location of the input file(s) for Hadoop-Gremlin to read the graph from.
gremlin.hadoop.graphReader	The class that the graph input file(s) are read with (e.g. an `InputFormat`).
gremlin.hadoop.outputLocation	The location to write the computed HadoopGraph to.
gremlin.hadoop.graphWriter	The class that the graph output file(s) are written with (e.g. an `OutputFormat`).
gremlin.hadoop.jarsInDistributedCache	Whether to upload the Hadoop-Gremlin jars to a distributed cache (necessary if jars are not on the machines' classpaths).
gremlin.hadoop.defaultGraphComputer	The default `GraphComputer` to use when `graph.compute()` is called. This is optional.

Along with the properties above, the numerous Hadoop specific properties can be added as needed to tune and parameterize the executed Hadoop-Gremlin job on the respective Hadoop cluster.

Important

As the size of the graphs being processed becomes large, it is important to fully understand how the underlying OLAP engine (e.g. Spark, etc.) works and understand the numerous parameterizations offered by these systems. Such knowledge can help alleviate out of memory exceptions, slow load times, slow processing times, garbage collection issues, etc.

OLTP Hadoop-Gremlin

It is possible to execute OLTP operations over a HadoopGraph. However, realize that the underlying HDFS files are not random access and thus, to retrieve a vertex, a linear scan is required. OLTP operations are useful for peeking into the graph prior to executing a long running OLAP job — e.g. g.V().valueMap().limit(10).

Warning

OLTP operations on HadoopGraph are not efficient. They require linear scans to execute and are unreasonable for large graphs. In such large graph situations, make use of TraversalVertexProgram which is the OLAP Gremlin machine.

hdfs.copyFromLocal('data/tinkerpop-modern.kryo', 'tinkerpop-modern.kryo')
hdfs.ls()
graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
g = traversal().with(graph)
g.V().count()
g.V().out().out().values('name')
g.V().group().by{it.value('name')[1]}.by('name').next()

OLAP Hadoop-Gremlin

Hadoop-Gremlin was designed to execute OLAP operations via GraphComputer. The OLTP examples presented previously are reproduced below, but using TraversalVertexProgram for the execution of the Gremlin traversal.

A Graph in TinkerPop can support any number of GraphComputer implementations. Out of the box, Hadoop-Gremlin supports the following two implementations.

SparkGraphComputer: Leverages Apache Spark to execute TinkerPop OLAP computations.
- The graph may fit within the total RAM of the cluster (supports larger graphs). Message passing is coordinated via Spark map/reduce/join operations on in-memory and disk-cached data (average speed traversals).

Tip	For those wanting to use the SugarPlugin with their submitted traversal, do `:remote config useSugar true` as well as `:plugin use tinkerpop.sugar` at the start of the Gremlin Console session if it is not already activated.

$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(4)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
plugin activated: tinkerpop.hadoop
gremlin> :install org.apache.tinkerpop spark-gremlin x.y.z
==>loaded: [org.apache.tinkerpop, spark-gremlin, x.y.z] - restart the console to use [tinkerpop.spark]
gremlin> :q
$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(4)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
plugin activated: tinkerpop.hadoop
gremlin> :plugin use tinkerpop.spark
==>tinkerpop.spark activated

Warning

Hadoop and Spark all depend on many of the same libraries (e.g. ZooKeeper, Snappy, Netty, Guava, etc.). Unfortunately, typically these dependencies are not to the same versions of the respective libraries. As such, it is may be necessary to manually cleanup dependency conflicts among different plugins.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementations-hadoop-start.asciidoc

implementations-hadoop-start.asciidoc

Hadoop-Gremlin

Installing Hadoop-Gremlin

Properties Files

OLTP Hadoop-Gremlin

OLAP Hadoop-Gremlin

Files

implementations-hadoop-start.asciidoc

Latest commit

History

implementations-hadoop-start.asciidoc

File metadata and controls

Hadoop-Gremlin

Installing Hadoop-Gremlin

Properties Files

OLTP Hadoop-Gremlin

OLAP Hadoop-Gremlin