© 2014 IBM Corporation
Advanced Data Retrieval and Analytics
with Apache Spark and Openstack Swift
Gil Vernik
IBM Research - Haifa
© 2014 IBM Corporation
Topics Covered in This Talk
§ Openstack Swift
§ Apache Spark
§ Basic integration between Spark and Swift
§ Advanced integration between Spark and Swift by
utilizing the Storlets technology.
© 2014 IBM Corporation
Digital Universe
More than 1.8 zettabytes
(1.8 trillion gigabytes)
Grows rapidly
80% owned by enterprises
75% generated by individuals
According IDC iView "Extracting Value from Chaos,"
© 2014 IBM Corporation
Map-Reduce, Databases, etc..
Data needs to be replicated, Time, Cost, etc..
© 2014 IBM Corporation
Can we do it better?
© 2014 IBM Corporation
Openstack Swift
§ A massively scalable object store
§ Known to work with thousands of
servers, stores petabytes of data.
§ Exposes REST API
§ Features:
– Storage polices
– Erasure codes
– Data replication
– ….
PUTProxy Nodes
Storage Nodes
© 2014 IBM Corporation
Apache Spark
§ Apache Spark™ is a fast and general engine for
large-scale data processing
– Up to 100x faster than Hadoop Map
Reduce in-memory, 10x faster on disk
§ Combines SQL, streaming, and complex analytics
§ Can read existing Hadoop data
§ Most active project in Apache today
© 2014 IBM Corporation
Swift enablement for data retrieval in Spark
§ Apache Spark implements Hadoop interfaces and can use
HDFS or Amazon S3 as a data source.
Swift
Network
§ IBM research enabled Spark to access data stored in
Openstack Swift.
© 2014 IBM Corporation
What do we analyze?
Swift
Network
Stored Data Input to Analytics
Images EXIF metadata
PDF Hidden metadata
LOGs Only ‘ERROR’ records
…. ….
© 2014 IBM Corporation
Yes! We can do it better.
© 2014 IBM Corporation
Storlets: Flexibly extend for Swift
Advanced Data processing inside Swift
§ Storlets is a way to ‘extend’
cloud computational capabilities
§ Storlet is compiled code,
deployed to Swift and when
triggered is executed by Storlet
Engine directly on storage
nodes.
§ Storlet engine - responsible to
execute every storlet in a secure
environment
§ Storlet is a standard Java code
© 2014 IBM Corporation
Storlets extend an object store by
moving computation to the data –
filtering, transforming, analyzing –
instead of bringing the data to the
computation
© 2014 IBM Corporation
Swift Storlets: How do they benefit Spark?
Swift Storlet
Network
Objects
Filter
Data processing+
© 2014 IBM Corporation
Storlets Enable Extending the Functionality of Spark
Example: analyzing EXIF metadata from photos
§ Object store is a
natural repository for
photos
§ Photos contain rich
capture metadata
§ Analyzing this
metadata for a set of
photos can show how
the camera is used
© 2014 IBM Corporation
Example: Analyzing EXIF metadata
Storlets can extract metadata, returning as JSON
(rather than of processing the binary data directly by Spark)
10MB 1KB
© 2014 IBM Corporation
Example: Analyzing EXIF metadata.
•  Spark accesses images via storlet
•  No change to Spark, only changes the URI
•  JSON file returned by storlet defines schema
•  SQL from Spark processes metadata
© 2014 IBM Corporation
Example: Analyzing EXIF metadata.
© 2014 IBM Corporation
Summary
§ Openstack Swift is the most popular open source
object store
§ Apache Spark is the next big thing in data analytics
§ Spark and Swift can be integrated
§ Storlets in Swift provide clear benefits for analytics
use cases.
Thank you!
More information
Gil Vernik, IBM Research -Haifa
gilv@il.ibm.com

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

  • 1.
    © 2014 IBMCorporation Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift Gil Vernik IBM Research - Haifa
  • 2.
    © 2014 IBMCorporation Topics Covered in This Talk § Openstack Swift § Apache Spark § Basic integration between Spark and Swift § Advanced integration between Spark and Swift by utilizing the Storlets technology.
  • 3.
    © 2014 IBMCorporation Digital Universe More than 1.8 zettabytes (1.8 trillion gigabytes) Grows rapidly 80% owned by enterprises 75% generated by individuals According IDC iView "Extracting Value from Chaos,"
  • 4.
    © 2014 IBMCorporation Map-Reduce, Databases, etc.. Data needs to be replicated, Time, Cost, etc..
  • 5.
    © 2014 IBMCorporation Can we do it better?
  • 6.
    © 2014 IBMCorporation Openstack Swift § A massively scalable object store § Known to work with thousands of servers, stores petabytes of data. § Exposes REST API § Features: – Storage polices – Erasure codes – Data replication – …. PUTProxy Nodes Storage Nodes
  • 7.
    © 2014 IBMCorporation Apache Spark § Apache Spark™ is a fast and general engine for large-scale data processing – Up to 100x faster than Hadoop Map Reduce in-memory, 10x faster on disk § Combines SQL, streaming, and complex analytics § Can read existing Hadoop data § Most active project in Apache today
  • 8.
    © 2014 IBMCorporation Swift enablement for data retrieval in Spark § Apache Spark implements Hadoop interfaces and can use HDFS or Amazon S3 as a data source. Swift Network § IBM research enabled Spark to access data stored in Openstack Swift.
  • 9.
    © 2014 IBMCorporation What do we analyze? Swift Network Stored Data Input to Analytics Images EXIF metadata PDF Hidden metadata LOGs Only ‘ERROR’ records …. ….
  • 10.
    © 2014 IBMCorporation Yes! We can do it better.
  • 11.
    © 2014 IBMCorporation Storlets: Flexibly extend for Swift Advanced Data processing inside Swift § Storlets is a way to ‘extend’ cloud computational capabilities § Storlet is compiled code, deployed to Swift and when triggered is executed by Storlet Engine directly on storage nodes. § Storlet engine - responsible to execute every storlet in a secure environment § Storlet is a standard Java code
  • 12.
    © 2014 IBMCorporation Storlets extend an object store by moving computation to the data – filtering, transforming, analyzing – instead of bringing the data to the computation
  • 13.
    © 2014 IBMCorporation Swift Storlets: How do they benefit Spark? Swift Storlet Network Objects Filter Data processing+
  • 14.
    © 2014 IBMCorporation Storlets Enable Extending the Functionality of Spark Example: analyzing EXIF metadata from photos § Object store is a natural repository for photos § Photos contain rich capture metadata § Analyzing this metadata for a set of photos can show how the camera is used
  • 15.
    © 2014 IBMCorporation Example: Analyzing EXIF metadata Storlets can extract metadata, returning as JSON (rather than of processing the binary data directly by Spark) 10MB 1KB
  • 16.
    © 2014 IBMCorporation Example: Analyzing EXIF metadata. •  Spark accesses images via storlet •  No change to Spark, only changes the URI •  JSON file returned by storlet defines schema •  SQL from Spark processes metadata
  • 17.
    © 2014 IBMCorporation Example: Analyzing EXIF metadata.
  • 18.
    © 2014 IBMCorporation Summary § Openstack Swift is the most popular open source object store § Apache Spark is the next big thing in data analytics § Spark and Swift can be integrated § Storlets in Swift provide clear benefits for analytics use cases. Thank you! More information Gil Vernik, IBM Research -Haifa [email protected]