Open Source Big Graph Analytics
on Neo4j with Apache Spark
Kenny Bastani (@kennybastani)
Speaker Introduction
§ Graph database enthusiast
§ Building microservice architectures
§ Lead developer at Digital Insight
§ Co-author of upcoming O’Reilly book — Spring Boot Essentials
The Problem
It's hard to analyze graphs at scale
The importance of graph algorithms
§ PageRank gave us Google
§ Friend of a friend gave us Facebook
§ Collaborative filtering makes Netflix recommendations awesome
Why is it so hard to do this stuff?
Enemy #1:
Relational databases store data in ways that make it difficult to extract graphs for analysis
Enemy #2:

If you still think Big Data is a buzz word
You haven’t had to feel the pain of failing at it.
When you hit a wall because your data is too big
You start to see what this big data thing is all about.
Distributed File Systems
Distributed file systems are a foundational component of big data analytics
Chops things into manageable sized blocks, usually 64MB
Spreads those blocks out across a cluster of VM resources
Hadoop MapReduce
Worth mentioning, Hadoop started this whole distributed MapReduce thing
You could translate the raw data from a CSV and turn it into a map of keys to values
Keys are distributed per node and used to reduce the values into a partitioned analysis
Graph algorithms can be evil at scale
It depends on the complexity of your graph
How many strongly connected components you have
But since some graph algorithms like PageRank are iterative
You have to iterate from one stage and use the results of the previous stage
It doesn't matter how many nodes you have in your cluster
For iterative graph algorithms, the complexity of the graph will make you or break you
Graphs with high complexity need a lot of memory to be processed iteratively
Neo4j Mazerunner Project
2-way Graph ETL
What is Neo4j Mazerunner?
The basic idea is…
Graph databases need ETL so you can analyze your data and look it up later.
Docker
If you’re not up on Docker, let me give you a quick intro.
Docker
Docker is a VM framework that let’s you easily create a recipe for an image and deploy applications with ease.
The idea is that infrastructure and operational complexity makes it hard for agile development of new products.
Why?
If I am an engineer on a product team, I want to choose my own software libraries and languages to solve
problems.
Microservices and Docker
§ If want to build a new service, use whatever application framework you want. As long as you communicate over
REST.
§ Docker gives you the freedom to use Neo4j, or MySQL, or MongoDB or whatever application dependency you
want inside your container.
Docker Compose
Docker Compose allows you to run multi-container applications
It uses a single YAML file
PageRank
Distributed PageRank
Questions?
kennybastani.com

Open Source Big Graph Analytics on Neo4j with Apache Spark

  • 1.
    Open Source BigGraph Analytics on Neo4j with Apache Spark Kenny Bastani (@kennybastani)
  • 2.
    Speaker Introduction § Graphdatabase enthusiast § Building microservice architectures § Lead developer at Digital Insight § Co-author of upcoming O’Reilly book — Spring Boot Essentials
  • 3.
    The Problem It's hardto analyze graphs at scale
  • 4.
    The importance ofgraph algorithms § PageRank gave us Google § Friend of a friend gave us Facebook § Collaborative filtering makes Netflix recommendations awesome
  • 5.
    Why is itso hard to do this stuff?
  • 6.
    Enemy #1: Relational databasesstore data in ways that make it difficult to extract graphs for analysis
  • 7.
    Enemy #2:
 If youstill think Big Data is a buzz word You haven’t had to feel the pain of failing at it.
  • 8.
    When you hita wall because your data is too big You start to see what this big data thing is all about.
  • 9.
    Distributed File Systems Distributedfile systems are a foundational component of big data analytics Chops things into manageable sized blocks, usually 64MB Spreads those blocks out across a cluster of VM resources
  • 10.
    Hadoop MapReduce Worth mentioning,Hadoop started this whole distributed MapReduce thing You could translate the raw data from a CSV and turn it into a map of keys to values Keys are distributed per node and used to reduce the values into a partitioned analysis
  • 11.
    Graph algorithms canbe evil at scale It depends on the complexity of your graph How many strongly connected components you have But since some graph algorithms like PageRank are iterative You have to iterate from one stage and use the results of the previous stage
  • 12.
    It doesn't matterhow many nodes you have in your cluster For iterative graph algorithms, the complexity of the graph will make you or break you Graphs with high complexity need a lot of memory to be processed iteratively
  • 13.
  • 14.
    What is Neo4jMazerunner?
  • 15.
    The basic ideais… Graph databases need ETL so you can analyze your data and look it up later.
  • 16.
    Docker If you’re notup on Docker, let me give you a quick intro.
  • 17.
    Docker Docker is aVM framework that let’s you easily create a recipe for an image and deploy applications with ease. The idea is that infrastructure and operational complexity makes it hard for agile development of new products.
  • 18.
    Why? If I aman engineer on a product team, I want to choose my own software libraries and languages to solve problems.
  • 19.
    Microservices and Docker §If want to build a new service, use whatever application framework you want. As long as you communicate over REST. § Docker gives you the freedom to use Neo4j, or MySQL, or MongoDB or whatever application dependency you want inside your container.
  • 21.
    Docker Compose Docker Composeallows you to run multi-container applications It uses a single YAML file
  • 23.
  • 24.
  • 26.