Big Data Open Source Tools and Trends: Enable Real-
Time Business Intelligence from Machine Logs
Eric Roch, Principal &
Ben Hahn, Senior Technical Architect
Perficient is a leading information technology consulting firm serving clients throughout
North America.
We help clients implement business-driven technology solutions that integrate business
processes, improve worker productivity, increase customer loyalty and create a more agile
enterprise to better respond to new business opportunities.
About Perficient
• Founded in 1997
• Public, NASDAQ: PRFT
• 2013 revenue $373 million
• Major market locations throughout North America
• Atlanta, Boston, Charlotte, Chicago, Cincinnati, Columbus,
Dallas, Denver, Detroit, Fairfax, Houston, Indianapolis,
Los Angeles, Minneapolis, New Orleans, New York City,
Northern California, Philadelphia, Southern California,
St. Louis, Toronto and Washington, D.C.
• Global delivery centers in China, Europe and India
• >2,100 colleagues
• Dedicated solution practices
• ~90% repeat business rate
• Alliance partnerships with major technology vendors
• Multiple vendor/industry technology and growth awards
Perficient Profile
BUSINESS SOLUTIONS
Business Intelligence
Business Process Management
Customer Experience and CRM
Enterprise Performance Management
Enterprise Resource Planning
Experience Design (XD)
Management Consulting
TECHNOLOGY SOLUTIONS
Business Integration/SOA
Cloud Services
Commerce
Content Management
Custom Application Development
Education
Information Management
Mobile Platforms
Platform Integration
Portal & Social
Our Solutions Expertise
Eric Roch
Principal
Eric leads Perficient's
national connected solutions
practice
• Includes focus on SOA/integration,
cloud, mobile and Big Data
• Author & industry speaker
• 25 years+ of experience in various
aspects of information technology
including:
• Executive-level management
• Enterprise architecture
• Application development
Speakers
Ben Hahn
Sr. Technical Architect
Ben Hahn is a Sr.
Technical Architect
• Includes focus on transactions, logging &
exceptions processing
• Author & speaker
• 20+ years of experience in various
aspects of information technology
including:
• Software solutions
• Enterprise infrastructure
• Product management
• Open Source software community
contributor
• Often defined as data that exceeds the capacities of
conventional database systems because it’s too large
and moves too fast for traditional database systems to
handle in an architecturally cohesive way. The three V’s
of Big Data are:
• Volume
• Most companies have 100 TB of data
• Facebook ingests 500 TB in a single day
• 40 ZettaBytes (that’s 43 trillion GB) of data by
2020
• Velocity
• NYSE captures 4-5 TB of data in a single day
• A Boeing 737 generates 243 TB in a single flight
• The Google self-driving car generates 750MB of
data per second!
• Variety
• Twitter, Clickstreams, Audio, Video
• GPS, Sensor data, Facebook content
• Infrastructure and application logs
What is Big Data?
POLL QUESTION:
What is your current adoption level for big data?
• Evaluation
• Prototype
• Production
But Not Everyone is Google!
Where’s the Big Data coming from?
POLL QUESTION
Have you used open source software for big data solutions?
• Yes
• No
Machine Data definitely has the three V’s of Big Data
Machine Data is Big Data
What Can We Gain From Machine Data?
Valuable information can be mined from
machine data, including:
• Transaction monitoring
• Error detection
• Behavior trends
• Audit logging
• Infrastructure states
• Anomaly detection
• Geospatial analysis
• Network analysis
Log Analysis vs. Business Analytics
• Ingest - Versus ETL
• Big Data - Bidirectional integration with Hadoop
• Query language - MapReduce function on unstructured
data
• Drill anywhere - Investigate on all the data versus a
predefined schema or cube
• Information discovery - Discover relationships based on
patterns in the data
• Ad-hoc versus dimensional - Log analysis is not based a
predefined structure based a point-in-time set of
requirements
• Explicit logging - Versus implicit correlation
Polling Question:
Do you mine machine data for business 
insights?
• Yes
• No
Innovations From Cloud and OSS
• Hadoop and MapReduce - Derived from Google's
MapReduce and Google File System
• Storm – Distributed event processor open sourced by
Twitter
• Presto - Facebook has released as open source a SQL
query engine built to work with petabyte-sized data
warehouses
• Google BigQuery - Run SQL-like queries against terabytes
of data in seconds
• Amazon DynamoDB - NoSQL database service to store
and retrieve any amount of data, and serve any level of
request traffic
• Elasticsearch – Distributed full-text search OSS community
POLLING QUESTION
Do you plan to use cloud based solutions for 
big data?
• Yes
• No
• 2004 - Google published a paper called MapReduce: Simplified Data
Processing on Large Clusters characterized by:
• Map and shuffle key-values data pairs and then aggregate/reduce these
intermediate data pairs
• Origins in map and reduce primitives in functional languages
• Massive parallelism and elasticity via commodity hardware
• Fault tolerance via master-worker nodes
Big Data Processing: MapReduce
2
• Based on Lambda (λ) calculus
• ALL computational functions and data can be expressed as
a series of functions and predicates of functions
• Declarative language rather than imperative
• First-order functions – Functions can be passed just like
values as arguments and returned as arguments. This also
allows currying and partial functions.
• Call by name – Function expressions are not evaluated
until they are actually used.
• Recursion – Functions evaluate to itself potentially in an
endless loop.
• Immutable state and values – Pure functional programming
does not consider variables but rather immutable values as
they appear in any moment in time. This has big effects on
scalability and concurrency.
• Referential Transparency - Functions can be replaced by
their values with no side effects.
• Pattern matching – Data type matching as well as data
structure composition and deep object type matching
• Erlang, Haskell, Lisp, Clojure, Scala
What are functional languages?
And MapReduce is Better with
Functional Languages
2
Imperative Model: Pascal, C. Basic, etc.
Evolution (or Devolution?) of Databases
2
Object Oriented Programming Model: Java,
C++,C#.
Evolution (or Devolution?) of Databases
2
Functional Programming Model:
Scala, Clojure, F#
Evolution (or Devolution?) of Databases
2
• Because commodity hardware in the cloud is infinitely
elastic, resource needs to query and run transactions
can be scaled in response to the data volumes at the
store level.
• Data is stored using functional programming concept of
immutability by only appending data as point-in-time
values.
• MapReduce functions can be balanced and distributed
across machines as nodes fail or new nodes are added.
• First-class functions and call by name allows function,
lambda expressions to be passed into MapReduce calls
as arguments allowing ad-hoc functionality to be added.
• Pattern matching allows very complex pattern matches
on complex structures like XML.
• Transactions use functional expressions like compare
and swap operations to ensure ACIDity.
• SQL or query expressions can be reduced to
MapReduce functions or lambda expressions and/or
patterns and distributed in parallel across the nodes.
• Using recursion, complex structures like XML can be
mapped and reduced from a single expression.
MapReduce Machine Data:
What Do We Need?
• A dynamic process for parsing
and mapping unstructured data
to structured data in real-time
• Wide range of data formats
(text, XML, JSON, CSV, EDI,
etc.)
• Need intelligent pattern
matching capabilities
• Ability to correlate meaningful
transactional data and metrics
from disparate data (reducing)
• Machine data is static and
immutable. Append-only fast
writes with eventual
consistency is ideal
• Need fast filter, search, query
capabilities to display results
Open Source Big Data Landscape
Source: www.bigdata‐startup.com
Apache Hadoop: The Elephant in the Room
• What about Apache Hadoop?
• Apache Hadoop comprises HDFS and the 
Hadoop MapReduce both based on Google’s GFS 
and MapReduce
• Batch oriented MapReduce jobs through 
Schedulers and JobTrackers
• Require real‐time MapReduce processes
• Need index, query, search on data in real‐time 
with a well‐defined interface
• We can use for secondary storage of long‐term 
persistent logs – Lambda Architecture (Batch vs
Speed Layer)
Apache Storm: Use Real-time
MapReduce for Machine Data Streams
• Developed by Backtype and acquired by Twitter
• Distributed computational framework that allows real-
time MapReduce functionality from any data source
streams using concept of Spouts and Bolts
• Read From Any Data Stream using Spouts (Kafka,
JMS, HTTP, etc.)
• Transactional and guaranteed message processing
• Parallelism and scalability
• Fault Tolerance (Master-Worker for MapReduce)
• MapReduce Topologies
• Offers Real-time MapReduce jobs (Or Bolts)
• Other tools: Apache Spark
Apache Storm: Use Real-time
MapReduce for Machine Data Streams
MapReduce - Declarative and simplicity of functional languages within
Storm
Elasticsearch: Distributed
Document Search
• Distributed search server engine using Apache Lucene
• It’s a Schema-less document store using JSON as it’s
document format. New fields can be added dynamically.
All fields are indexed by default
• Uses index shards to distribute queries and searches
across clusters. Queries and searches are run in parallel
• Cluster can host multiple indexes and can be queried as
a group or singly. Index aliases allows indexes to be
added or dropped dynamically
• Append-only model using versioning. Writes very fast
depending on wait model (wait for all shards to be written
or a quorom or none)
• Well-defined RESTful API interface. Very powerful query
features
• Other tools: Apache Solr
Elasticsearch: Distributed
Document Search
Elasticsearch: Distributed Query and searches using index shards and replicas
A Really Cool UI to Show This Off
• Kibana – Works seamlessly with Elasticsearch, queries Elasticsearch
directly from Javascript
• Everything is user driven, very little coding except some configuration
settings in yaml
• Very dynamic screen interface
• Screen layout, queries, filters, graphs, histograms are saved directly to
Elasticsearch
• Great design and user interface
Putting It In Action: Demo
As a reminder, please submit your
questions in the chat box
We will get to as many as possible!
4/1/2014
Daily unique content
about content
management, user
experience, portals
and other enterprise
information technology
solutions across a
variety of industries.
Perficient.com/SocialMedia
Facebook.com/Perficient
Twitter.com/Perficient
Thank you for your participation today.
Please fill out the survey at the close of this session.
4/1/2014

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

  • 1.
    Big Data OpenSource Tools and Trends: Enable Real- Time Business Intelligence from Machine Logs Eric Roch, Principal & Ben Hahn, Senior Technical Architect
  • 2.
    Perficient is aleading information technology consulting firm serving clients throughout North America. We help clients implement business-driven technology solutions that integrate business processes, improve worker productivity, increase customer loyalty and create a more agile enterprise to better respond to new business opportunities. About Perficient
  • 3.
    • Founded in1997 • Public, NASDAQ: PRFT • 2013 revenue $373 million • Major market locations throughout North America • Atlanta, Boston, Charlotte, Chicago, Cincinnati, Columbus, Dallas, Denver, Detroit, Fairfax, Houston, Indianapolis, Los Angeles, Minneapolis, New Orleans, New York City, Northern California, Philadelphia, Southern California, St. Louis, Toronto and Washington, D.C. • Global delivery centers in China, Europe and India • >2,100 colleagues • Dedicated solution practices • ~90% repeat business rate • Alliance partnerships with major technology vendors • Multiple vendor/industry technology and growth awards Perficient Profile
  • 4.
    BUSINESS SOLUTIONS Business Intelligence BusinessProcess Management Customer Experience and CRM Enterprise Performance Management Enterprise Resource Planning Experience Design (XD) Management Consulting TECHNOLOGY SOLUTIONS Business Integration/SOA Cloud Services Commerce Content Management Custom Application Development Education Information Management Mobile Platforms Platform Integration Portal & Social Our Solutions Expertise
  • 5.
    Eric Roch Principal Eric leadsPerficient's national connected solutions practice • Includes focus on SOA/integration, cloud, mobile and Big Data • Author & industry speaker • 25 years+ of experience in various aspects of information technology including: • Executive-level management • Enterprise architecture • Application development Speakers Ben Hahn Sr. Technical Architect Ben Hahn is a Sr. Technical Architect • Includes focus on transactions, logging & exceptions processing • Author & speaker • 20+ years of experience in various aspects of information technology including: • Software solutions • Enterprise infrastructure • Product management • Open Source software community contributor
  • 6.
    • Often definedas data that exceeds the capacities of conventional database systems because it’s too large and moves too fast for traditional database systems to handle in an architecturally cohesive way. The three V’s of Big Data are: • Volume • Most companies have 100 TB of data • Facebook ingests 500 TB in a single day • 40 ZettaBytes (that’s 43 trillion GB) of data by 2020 • Velocity • NYSE captures 4-5 TB of data in a single day • A Boeing 737 generates 243 TB in a single flight • The Google self-driving car generates 750MB of data per second! • Variety • Twitter, Clickstreams, Audio, Video • GPS, Sensor data, Facebook content • Infrastructure and application logs What is Big Data?
  • 7.
    POLL QUESTION: What isyour current adoption level for big data? • Evaluation • Prototype • Production
  • 8.
    But Not Everyoneis Google! Where’s the Big Data coming from?
  • 9.
    POLL QUESTION Have youused open source software for big data solutions? • Yes • No
  • 10.
    Machine Data definitelyhas the three V’s of Big Data Machine Data is Big Data
  • 11.
    What Can WeGain From Machine Data? Valuable information can be mined from machine data, including: • Transaction monitoring • Error detection • Behavior trends • Audit logging • Infrastructure states • Anomaly detection • Geospatial analysis • Network analysis
  • 12.
    Log Analysis vs.Business Analytics • Ingest - Versus ETL • Big Data - Bidirectional integration with Hadoop • Query language - MapReduce function on unstructured data • Drill anywhere - Investigate on all the data versus a predefined schema or cube • Information discovery - Discover relationships based on patterns in the data • Ad-hoc versus dimensional - Log analysis is not based a predefined structure based a point-in-time set of requirements • Explicit logging - Versus implicit correlation
  • 13.
  • 14.
    Innovations From Cloudand OSS • Hadoop and MapReduce - Derived from Google's MapReduce and Google File System • Storm – Distributed event processor open sourced by Twitter • Presto - Facebook has released as open source a SQL query engine built to work with petabyte-sized data warehouses • Google BigQuery - Run SQL-like queries against terabytes of data in seconds • Amazon DynamoDB - NoSQL database service to store and retrieve any amount of data, and serve any level of request traffic • Elasticsearch – Distributed full-text search OSS community
  • 15.
  • 16.
    • 2004 -Google published a paper called MapReduce: Simplified Data Processing on Large Clusters characterized by: • Map and shuffle key-values data pairs and then aggregate/reduce these intermediate data pairs • Origins in map and reduce primitives in functional languages • Massive parallelism and elasticity via commodity hardware • Fault tolerance via master-worker nodes Big Data Processing: MapReduce 2
  • 17.
    • Based onLambda (λ) calculus • ALL computational functions and data can be expressed as a series of functions and predicates of functions • Declarative language rather than imperative • First-order functions – Functions can be passed just like values as arguments and returned as arguments. This also allows currying and partial functions. • Call by name – Function expressions are not evaluated until they are actually used. • Recursion – Functions evaluate to itself potentially in an endless loop. • Immutable state and values – Pure functional programming does not consider variables but rather immutable values as they appear in any moment in time. This has big effects on scalability and concurrency. • Referential Transparency - Functions can be replaced by their values with no side effects. • Pattern matching – Data type matching as well as data structure composition and deep object type matching • Erlang, Haskell, Lisp, Clojure, Scala What are functional languages? And MapReduce is Better with Functional Languages 2
  • 18.
    Imperative Model: Pascal,C. Basic, etc. Evolution (or Devolution?) of Databases 2
  • 19.
    Object Oriented ProgrammingModel: Java, C++,C#. Evolution (or Devolution?) of Databases 2
  • 20.
    Functional Programming Model: Scala,Clojure, F# Evolution (or Devolution?) of Databases 2 • Because commodity hardware in the cloud is infinitely elastic, resource needs to query and run transactions can be scaled in response to the data volumes at the store level. • Data is stored using functional programming concept of immutability by only appending data as point-in-time values. • MapReduce functions can be balanced and distributed across machines as nodes fail or new nodes are added. • First-class functions and call by name allows function, lambda expressions to be passed into MapReduce calls as arguments allowing ad-hoc functionality to be added. • Pattern matching allows very complex pattern matches on complex structures like XML. • Transactions use functional expressions like compare and swap operations to ensure ACIDity. • SQL or query expressions can be reduced to MapReduce functions or lambda expressions and/or patterns and distributed in parallel across the nodes. • Using recursion, complex structures like XML can be mapped and reduced from a single expression.
  • 21.
    MapReduce Machine Data: WhatDo We Need? • A dynamic process for parsing and mapping unstructured data to structured data in real-time • Wide range of data formats (text, XML, JSON, CSV, EDI, etc.) • Need intelligent pattern matching capabilities • Ability to correlate meaningful transactional data and metrics from disparate data (reducing) • Machine data is static and immutable. Append-only fast writes with eventual consistency is ideal • Need fast filter, search, query capabilities to display results
  • 22.
    Open Source BigData Landscape Source: www.bigdata‐startup.com
  • 23.
    Apache Hadoop: TheElephant in the Room • What about Apache Hadoop? • Apache Hadoop comprises HDFS and the  Hadoop MapReduce both based on Google’s GFS  and MapReduce • Batch oriented MapReduce jobs through  Schedulers and JobTrackers • Require real‐time MapReduce processes • Need index, query, search on data in real‐time  with a well‐defined interface • We can use for secondary storage of long‐term  persistent logs – Lambda Architecture (Batch vs Speed Layer)
  • 24.
    Apache Storm: UseReal-time MapReduce for Machine Data Streams • Developed by Backtype and acquired by Twitter • Distributed computational framework that allows real- time MapReduce functionality from any data source streams using concept of Spouts and Bolts • Read From Any Data Stream using Spouts (Kafka, JMS, HTTP, etc.) • Transactional and guaranteed message processing • Parallelism and scalability • Fault Tolerance (Master-Worker for MapReduce) • MapReduce Topologies • Offers Real-time MapReduce jobs (Or Bolts) • Other tools: Apache Spark
  • 25.
    Apache Storm: UseReal-time MapReduce for Machine Data Streams MapReduce - Declarative and simplicity of functional languages within Storm
  • 26.
    Elasticsearch: Distributed Document Search •Distributed search server engine using Apache Lucene • It’s a Schema-less document store using JSON as it’s document format. New fields can be added dynamically. All fields are indexed by default • Uses index shards to distribute queries and searches across clusters. Queries and searches are run in parallel • Cluster can host multiple indexes and can be queried as a group or singly. Index aliases allows indexes to be added or dropped dynamically • Append-only model using versioning. Writes very fast depending on wait model (wait for all shards to be written or a quorom or none) • Well-defined RESTful API interface. Very powerful query features • Other tools: Apache Solr
  • 27.
    Elasticsearch: Distributed Document Search Elasticsearch:Distributed Query and searches using index shards and replicas
  • 28.
    A Really CoolUI to Show This Off • Kibana – Works seamlessly with Elasticsearch, queries Elasticsearch directly from Javascript • Everything is user driven, very little coding except some configuration settings in yaml • Very dynamic screen interface • Screen layout, queries, filters, graphs, histograms are saved directly to Elasticsearch • Great design and user interface
  • 29.
    Putting It InAction: Demo
  • 30.
    As a reminder,please submit your questions in the chat box We will get to as many as possible! 4/1/2014
  • 31.
    Daily unique content aboutcontent management, user experience, portals and other enterprise information technology solutions across a variety of industries. Perficient.com/SocialMedia Facebook.com/Perficient Twitter.com/Perficient
  • 32.
    Thank you foryour participation today. Please fill out the survey at the close of this session. 4/1/2014