Zohar Elkayam
CTO, Brillix
Zohar@Brillix.co.il
Big Data For CIOs
Who am I?
• Zohar Elkayam, CTO at Brillix
• DBA, team leader, and a senior consultant for over 17 years
• Oracle ACE Associate
• Involved with Big Data projects since 2011
• Blogger – www.realdbamagic.com
http://brillix.co.il2
About Brillix
• Brillix is a leading company that specialized in Data
Management
• We provide professional services and consulting for
Databases, Security and Big Data solutions
3
Agenda: Big Data
• Big Data
• Why
• What
• Where
• Who and How
• A Big Data Solution: Hadoop
• NoSQL vs. RDBMS
4 http://brillix.co.il
What is Big Data?
http://brillix.co.il5
"Big Data"??
Different definitions
“Bigdataexceedsthereachofcommonlyusedhardwareenvironments
andsoftwaretoolstocapture,manage,andprocessitwithinatolerable
elapsedtimeforitsuserpopulation.”-TeradataMagazinearticle,2011
“Bigdatareferstodatasetswhosesizeisbeyondtheabilityoftypical
databasesoftwaretoolstocapture,store,manageandanalyze.”
- TheMcKinseyGlobalInstitute, 2012
“Bigdataisacollectionofdatasetssolargeandcomplexthatit
becomesdifficulttoprocessusingon-handdatabasemanagement
tools.” -Wikipedia, 2014
http://brillix.co.il6
http://brillix.co.il7
Success Stories
http://brillix.co.il8
More success stories
http://brillix.co.il9
MORE stories..
• Crime Prevention in Los Angeles
• Diagnosis and treatment of genetic diseases
• Investments in the financial sector
• Generation of personalized advertising
• Astronomical discoveries
http://brillix.co.il10
Examples of Big Data Use Cases Today
MEDIA/
ENTERTAINMENT
Viewers / advertising
effectiveness
COMMUNICATIONS
Location-based advertising
EDUCATION &
RESEARCH
Experiment sensor
analysis
CONSUMER PACKAGED
GOODS
Sentiment analysis of what’s
hot, problems
HEALTH CARE
Patient sensors,
monitoring, EHRs
Quality of care
LIFE SCIENCES
Clinical trials
Genomics
HIGH TECHNOLOGY /
INDUSTRIAL MFG.
Mfg quality
Warranty analysis
OIL & GAS
Drilling exploration
sensor analysis
FINANCIAL
SERVICES
Risk & portfolio analysis
New products
AUTOMOTIVE
Auto sensors reporting
location, problems
RETAIL
Consumer sentiment
Optimized marketing
LAW ENFORCEMENT
& DEFENSE
Threat analysis - social
media monitoring, photo
analysis
TRAVEL &
TRANSPORTATION
Sensor analysis for optimal
traffic flows
Customer sentiment
UTILITIES
Smart Meter
analysis for
network capacity,
ON-LINE SERVICES /
SOCIAL MEDIA
People & career
matching
Web-site
optimization
http://brillix.co.il11
Most Requested Uses of Big Data
• Log Analytics & Storage
• Smart Grid / Smarter Utilities
• RFID Tracking & Analytics
• Fraud / Risk Management & Modeling
• 360° View of the Customer
• Warehouse Extension
• Email / Call Center Transcript Analysis
• Call Detail Record Analysis
12 http://brillix.co.il
The Challenge
http://brillix.co.il13
The Big Data Challenge
http://brillix.co.il14
Volume
• Big data come in one size: Big.
• Size is measured in Terabyte(1012), Petabyte(1015), Exabyte(1018),
Zettabyte (1021)
• The storing and handling of the data becomes an issue
• Producing value out of the data in a reasonable time is an issue
15 http://brillix.co.il
Some numbers
• How much data in the world?
• 800 Terabytes, 2000
• 160 Exabytes, 2006 (1EB = 1018B)
• 4.5 Zettabytes, 2012 (1ZB = 1021B)
• 44 Zettabytes by 2020
• How much is a zettabyte?
• 1,000,000,000,000,000,000,000 bytes
• A stack of 1TB hard disks that is 25,400 km high
http://brillix.co.il16
Growth Rate
• How much data
generated in a
day?
• 7 TB, Twitter
• 10 TB, Facebook
http://brillix.co.il17
Data grows fast!
http://brillix.co.il18
Variety
• Big Data extends beyond structured data: including
semi-structured and unstructured information: logs,
text, audio and videos.
• Wide variety of rapidly evolving data types requires
highly flexible stores and handling.
19 http://brillix.co.il
Structured & Un-Structured
Un-Structured Structured
Objects Tables
Flexible Columns and Rows
Structure Unknown Predefined Structure
Textual and Binary Mostly Textual
http://brillix.co.il20
Big Data is ANY data
• Some has fixed structure
• Some is “bring own structure”
• We want to find value in all of it
Unstructured, Semi-Structure and Structured
http://brillix.co.il21
Data Types by Industry
http://brillix.co.il22
Velocity
• The speed in which the data is being generated and collected
• Streaming data and large volume data movement
• High velocity of data capture – requires rapid ingestion
• Might cause the backlog problem
23 http://brillix.co.il
Global Internet Device Forecast
http://brillix.co.il24
http://brillix.co.il25
Internet of Things
Veracity
• Quality of the data can vary greatly
• Data sources might be messy or corrupted
http://brillix.co.il26
So, What Defines Big Data?
• When we think that we can produce value from that data and
want to handle it
• When the data is too big or moves too fast to handle in a
sensible amount of time
• When the data doesn’t fit conventional database structure
• When the solution becomes part of the problem
27 http://brillix.co.il
http://brillix.co.il28
Why Big Data Now?
• Because we have data:
• Data is born already in digital form
• 40% of data growth per year
• Because we can:
• 500$ for a drive in which to store all the music of the world
• 40 years of Moore's Law = large computational resources
• 64% of organizations have invested in big data in 2013
• 34 billion $ invested in big data in 2013
“Because we reached dead end with logic”
http://brillix.co.il29
How to do Big Data
http://brillix.co.il30
31 http://brillix.co.il
Big Data in Practice
• Big data is big: technological infrastructure solutions needed
• Big data is messy: data sources must be cleaned before use
• Big data is complicated: need developers and system admins
to manage intake of data
http://brillix.co.il32
Big Data in Practice (cont.)
• Data must be broken out of silos in order to be mined, analyzed
and transformed into value
• The organization must learn how to communicate and interpret
the results of analysis
http://brillix.co.il33
Infrastructure Challenges
• Infrastructure that is built for:
• Large-scale
• Distributed
• Data-intensive jobs that spread the problem across clusters of server
nodes
34 http://brillix.co.il
Infrastructure Challenges (cont.)
• Storage:
• Efficient and cost-effective enough to capture and store terabytes, if
not petabytes, of data
• With intelligent capabilities to reduce your data footprint such as:
• Data compression
• Automatic data tiering
• Data deduplication
35 http://brillix.co.il
Infrastructure Challenges (cont.)
• Network infrastructure that can quickly import large data sets
and then replicate it to various nodes for processing
• Security capabilities that protect highly-distributed
infrastructure and data
36 http://brillix.co.il
Goals of Analytics
http://brillix.co.il37
Positions in Big Data management
• DevOps are handling the infrastructure – sys admins and
cluster manager
• Data scientists are in charge of producing value from the data
http://brillix.co.il38
Data Scientist
http://brillix.co.il39
Hadoop
http://brillix.co.il40
Apache Hadoop
• Open source project run by Apache (2006)
• Hadoop brings the ability to cheaply process large amounts of
data, regardless of its structure
• It Is has been the driving force behind the growth of the big
data Industry
• Get the public release from:
• http://hadoop.apache.org/core/
41 http://brillix.co.il
Hadoop Creation History
http://brillix.co.il42
Key points
• An open-source framework that uses a simple programming model to
enable distributed processing of large data sets on clusters of
computers.
• The complete technology stack includes
• common utilities
• a distributed file system
• analytics and data storage platforms
• an application layer that manages distributed processing, parallel
computation, workflow, and configuration management
• Cost-effective for handling large unstructured data sets than
conventional approaches, and it offers massive scalability and speed
43
Why use Hadoop?
Cost Flexibility
Near linear
performance up
to 1000s of
nodes
Leverages
commodity HW &
open source SW
Versatility with
data, analytics &
operation
Scalability
http://brillix.co.il44
What Hadoop Is Not?
• Hadoop does not replace DW or relational databases
• Hadoop is not for OLTP or real-time systems
• Very good for large amount, not so much for smaller sets
• Designed for clusters – there is Hadoop monster server (single
server)
http://brillix.co.il45
Hadoop Cluster in Yahoo
46
Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)
http://brillix.co.il
Hadoop under the Hood
http://brillix.co.il47
Hadoop Main Components
• HDFS: Hadoop Distributed File System – distributed file
system that runs in a clustered environment.
• MapReduce – programming paradigm for running processes
over a clustered environments.
48 http://brillix.co.il
HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System
49 http://brillix.co.il
MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and performing such
computations
• An open-source implementation called Hadoop
50 http://brillix.co.il
MapReduce is good for...
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data sets
• Analyzing an entire large dataset
51 http://brillix.co.il
MapReduce is OK for...
• Iterative jobs (i.e., graph algorithms)
• Each iteration must read/write data to disk
• IO and latency cost of an iteration is high
52 http://brillix.co.il
MapReduce is NOT good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
53 http://brillix.co.il
Spark
• Fast and general MapReduce-like engine for large-scale data
processing
• Fast
• In memory data storage for very fast interactive queries Up to 100 times
faster then Hadoop
• General
• Unified platform that can combine: SQL, Machine Learning , Streaming ,
Graph & Complex analytics
• Ease of use
• Can be developed in Java, Scala or Python
• Integrated with Hadoop
• Can read from HDFS, HBase, Cassandra, and any Hadoop data source.
54
Key Concepts
55
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g. map, filter, groupBy)
• Actions
(e.g. count, collect, save)
Write programs in terms of transformations on
distributed datasets
Unified Platform
• Continued innovation bringing new functionality, e.g.:
• Java 8 (Closures, LambaExpressions)
• Spark SQL (SQL on Spark, not just Hive)
• BlinkDB(Approximate Queries)
• SparkR(R wrapper for Spark)
56
Big Data and NoSQL
http://brillix.co.il57
The Challenge
• We want scalable, durable, high volume, high velocity,
distributed data storage that can handle non-structured data
and that will fit our specific need
• RDBMS is too generic and doesn’t cut it any more – it can do
the job but it is not cost effective to our usages
58 http://brillix.co.il
The Solution: NoSQL
• Let’s take some parts of the standard RDBMS out to and
design the solution to our specific uses
• NoSQL databases have been around for ages under different
names/solutions
59 http://brillix.co.il
Example Comparison: RDBMS vs. Hadoop
60
Typical Traditional RDBMS Hadoop
Data Size Gigabytes Petabytes
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
http://brillix.co.il
Best Used For:
 Structured or Not (Flexibility)
 Scalability of Storage/Compute
 Complex Data Processing
 Cheaper compared to RDBMS
Relational Database
Best Used For:
 Interactive OLAP Analytics
(<1sec)
 Multistep Transactions
 100% SQL Compliance
Best when used together
Hadoop And Relational Database
61 http://brillix.co.il
The NOSQL Movement
• NOSQL is not a technology – it’s a concept
• We need high performance, scale out abilities or agile structure
• We are willing to sacrifice our sacred database cows:
consistency, transactions, durability
• Over 150 different brands and solutions
(http://nosql-database.org/).
62 http://brillix.co.il
Is NoSQL a RDMS Replacement?
NO
63
Well... Sometimes it does…
http://brillix.co.il
NoSQL Taxonomy
Type Examples
Key-Value Store
Document Store
Column Store
Graph Store
http://brillix.co.il64
Key Value Store
• Distributed hash tables
• Very fast to get a single value
• Examples:
• Amazon DynamoDB
• Berkeley DB
• Redis
• Riak
• Cassandra
65 http://brillix.co.il
Document Store
• Similar to Key/Value, but value is a document
• JSON or something similar, flexible schema
• Agile technology
• Examples:
• MongoDB
• CouchDB
• CouchBase
66 http://brillix.co.il
What is a Column Store Database?
• Column Store databases are management systems that uses
data managed in a columnar structure format for better
analysis of single column data (i.e. aggregation). Data is saved
and handled as columns instead of rows.
• Examples:
• HP Vertica
• Pivotal (EMC) GreenPlum
• Hadoop Hbase
• Amazon’s SimpleDB
• Cassandra
http://brillix.co.il67
Query Data
• When we query data, records are read at the
order they are organized in the physical structure
• Even when we query a single
column, we still need to read the
entire table and extract the column
Row 1
Row 2
Row 3
Row 4
Col 1 Col 2 Col 3 Col 4
Select Col2
From MyTable
Select *
From MyTable
http://brillix.co.il68
How Does Column Stores Keep Data
Organization in row store Organization in column store
http://brillix.co.il69
Select Col2
From MyTable
Row Format vs. Column Format
http://brillix.co.il71
Graph Store
• Inspired by the graph theory
• Data model: nodes, relationships, properties on both sides
• Relational database have a hard time to represent a graph in
the Database
• Example:
• Neo4j
• InfiniteGraph
• RDF
72 http://brillix.co.il
Graph Example
http://brillix.co.il73
Conclusion
• We do Big Data to gain Value. Without value, there is no Big Data
• Handling Big Data is a challenge – we talked about who uses it, when
and where
• Hadoop is a solution for Big Data usages but it’s not a magical solution
• NoSQL, NewSQL and RDBMS are all solutions we can integrate for
different usages
• New organizational positions: cluster devops and data scientist.
http://brillix.co.il74
Q&A
http://brillix.co.il75
Thank You
Zohar Elkayam
twitter: @realmgic
Zohar@Brillix.co.il
www.realdbamagic.com
http://brillix.co.il76

Big data for cio 2015

  • 1.
  • 2.
    Who am I? •Zohar Elkayam, CTO at Brillix • DBA, team leader, and a senior consultant for over 17 years • Oracle ACE Associate • Involved with Big Data projects since 2011 • Blogger – www.realdbamagic.com http://brillix.co.il2
  • 3.
    About Brillix • Brillixis a leading company that specialized in Data Management • We provide professional services and consulting for Databases, Security and Big Data solutions 3
  • 4.
    Agenda: Big Data •Big Data • Why • What • Where • Who and How • A Big Data Solution: Hadoop • NoSQL vs. RDBMS 4 http://brillix.co.il
  • 5.
    What is BigData? http://brillix.co.il5
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    MORE stories.. • CrimePrevention in Los Angeles • Diagnosis and treatment of genetic diseases • Investments in the financial sector • Generation of personalized advertising • Astronomical discoveries http://brillix.co.il10
  • 11.
    Examples of BigData Use Cases Today MEDIA/ ENTERTAINMENT Viewers / advertising effectiveness COMMUNICATIONS Location-based advertising EDUCATION & RESEARCH Experiment sensor analysis CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, problems HEALTH CARE Patient sensors, monitoring, EHRs Quality of care LIFE SCIENCES Clinical trials Genomics HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg quality Warranty analysis OIL & GAS Drilling exploration sensor analysis FINANCIAL SERVICES Risk & portfolio analysis New products AUTOMOTIVE Auto sensors reporting location, problems RETAIL Consumer sentiment Optimized marketing LAW ENFORCEMENT & DEFENSE Threat analysis - social media monitoring, photo analysis TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment UTILITIES Smart Meter analysis for network capacity, ON-LINE SERVICES / SOCIAL MEDIA People & career matching Web-site optimization http://brillix.co.il11
  • 12.
    Most Requested Usesof Big Data • Log Analytics & Storage • Smart Grid / Smarter Utilities • RFID Tracking & Analytics • Fraud / Risk Management & Modeling • 360° View of the Customer • Warehouse Extension • Email / Call Center Transcript Analysis • Call Detail Record Analysis 12 http://brillix.co.il
  • 13.
  • 14.
    The Big DataChallenge http://brillix.co.il14
  • 15.
    Volume • Big datacome in one size: Big. • Size is measured in Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabyte (1021) • The storing and handling of the data becomes an issue • Producing value out of the data in a reasonable time is an issue 15 http://brillix.co.il
  • 16.
    Some numbers • Howmuch data in the world? • 800 Terabytes, 2000 • 160 Exabytes, 2006 (1EB = 1018B) • 4.5 Zettabytes, 2012 (1ZB = 1021B) • 44 Zettabytes by 2020 • How much is a zettabyte? • 1,000,000,000,000,000,000,000 bytes • A stack of 1TB hard disks that is 25,400 km high http://brillix.co.il16
  • 17.
    Growth Rate • Howmuch data generated in a day? • 7 TB, Twitter • 10 TB, Facebook http://brillix.co.il17
  • 18.
  • 19.
    Variety • Big Dataextends beyond structured data: including semi-structured and unstructured information: logs, text, audio and videos. • Wide variety of rapidly evolving data types requires highly flexible stores and handling. 19 http://brillix.co.il
  • 20.
    Structured & Un-Structured Un-StructuredStructured Objects Tables Flexible Columns and Rows Structure Unknown Predefined Structure Textual and Binary Mostly Textual http://brillix.co.il20
  • 21.
    Big Data isANY data • Some has fixed structure • Some is “bring own structure” • We want to find value in all of it Unstructured, Semi-Structure and Structured http://brillix.co.il21
  • 22.
    Data Types byIndustry http://brillix.co.il22
  • 23.
    Velocity • The speedin which the data is being generated and collected • Streaming data and large volume data movement • High velocity of data capture – requires rapid ingestion • Might cause the backlog problem 23 http://brillix.co.il
  • 24.
    Global Internet DeviceForecast http://brillix.co.il24
  • 25.
  • 26.
    Veracity • Quality ofthe data can vary greatly • Data sources might be messy or corrupted http://brillix.co.il26
  • 27.
    So, What DefinesBig Data? • When we think that we can produce value from that data and want to handle it • When the data is too big or moves too fast to handle in a sensible amount of time • When the data doesn’t fit conventional database structure • When the solution becomes part of the problem 27 http://brillix.co.il
  • 28.
  • 29.
    Why Big DataNow? • Because we have data: • Data is born already in digital form • 40% of data growth per year • Because we can: • 500$ for a drive in which to store all the music of the world • 40 years of Moore's Law = large computational resources • 64% of organizations have invested in big data in 2013 • 34 billion $ invested in big data in 2013 “Because we reached dead end with logic” http://brillix.co.il29
  • 30.
    How to doBig Data http://brillix.co.il30
  • 31.
  • 32.
    Big Data inPractice • Big data is big: technological infrastructure solutions needed • Big data is messy: data sources must be cleaned before use • Big data is complicated: need developers and system admins to manage intake of data http://brillix.co.il32
  • 33.
    Big Data inPractice (cont.) • Data must be broken out of silos in order to be mined, analyzed and transformed into value • The organization must learn how to communicate and interpret the results of analysis http://brillix.co.il33
  • 34.
    Infrastructure Challenges • Infrastructurethat is built for: • Large-scale • Distributed • Data-intensive jobs that spread the problem across clusters of server nodes 34 http://brillix.co.il
  • 35.
    Infrastructure Challenges (cont.) •Storage: • Efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data • With intelligent capabilities to reduce your data footprint such as: • Data compression • Automatic data tiering • Data deduplication 35 http://brillix.co.il
  • 36.
    Infrastructure Challenges (cont.) •Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing • Security capabilities that protect highly-distributed infrastructure and data 36 http://brillix.co.il
  • 37.
  • 38.
    Positions in BigData management • DevOps are handling the infrastructure – sys admins and cluster manager • Data scientists are in charge of producing value from the data http://brillix.co.il38
  • 39.
  • 40.
  • 41.
    Apache Hadoop • Opensource project run by Apache (2006) • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure • It Is has been the driving force behind the growth of the big data Industry • Get the public release from: • http://hadoop.apache.org/core/ 41 http://brillix.co.il
  • 42.
  • 43.
    Key points • Anopen-source framework that uses a simple programming model to enable distributed processing of large data sets on clusters of computers. • The complete technology stack includes • common utilities • a distributed file system • analytics and data storage platforms • an application layer that manages distributed processing, parallel computation, workflow, and configuration management • Cost-effective for handling large unstructured data sets than conventional approaches, and it offers massive scalability and speed 43
  • 44.
    Why use Hadoop? CostFlexibility Near linear performance up to 1000s of nodes Leverages commodity HW & open source SW Versatility with data, analytics & operation Scalability http://brillix.co.il44
  • 45.
    What Hadoop IsNot? • Hadoop does not replace DW or relational databases • Hadoop is not for OLTP or real-time systems • Very good for large amount, not so much for smaller sets • Designed for clusters – there is Hadoop monster server (single server) http://brillix.co.il45
  • 46.
    Hadoop Cluster inYahoo 46 Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!) http://brillix.co.il
  • 47.
    Hadoop under theHood http://brillix.co.il47
  • 48.
    Hadoop Main Components •HDFS: Hadoop Distributed File System – distributed file system that runs in a clustered environment. • MapReduce – programming paradigm for running processes over a clustered environments. 48 http://brillix.co.il
  • 49.
    HDFS is... • Adistributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System 49 http://brillix.co.il
  • 50.
    MapReduce is... • Aprogramming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop 50 http://brillix.co.il
  • 51.
    MapReduce is goodfor... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset 51 http://brillix.co.il
  • 52.
    MapReduce is OKfor... • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk • IO and latency cost of an iteration is high 52 http://brillix.co.il
  • 53.
    MapReduce is NOTgood for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records 53 http://brillix.co.il
  • 54.
    Spark • Fast andgeneral MapReduce-like engine for large-scale data processing • Fast • In memory data storage for very fast interactive queries Up to 100 times faster then Hadoop • General • Unified platform that can combine: SQL, Machine Learning , Streaming , Graph & Complex analytics • Ease of use • Can be developed in Java, Scala or Python • Integrated with Hadoop • Can read from HDFS, HBase, Cassandra, and any Hadoop data source. 54
  • 55.
    Key Concepts 55 Resilient DistributedDatasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  • 56.
    Unified Platform • Continuedinnovation bringing new functionality, e.g.: • Java 8 (Closures, LambaExpressions) • Spark SQL (SQL on Spark, not just Hive) • BlinkDB(Approximate Queries) • SparkR(R wrapper for Spark) 56
  • 57.
    Big Data andNoSQL http://brillix.co.il57
  • 58.
    The Challenge • Wewant scalable, durable, high volume, high velocity, distributed data storage that can handle non-structured data and that will fit our specific need • RDBMS is too generic and doesn’t cut it any more – it can do the job but it is not cost effective to our usages 58 http://brillix.co.il
  • 59.
    The Solution: NoSQL •Let’s take some parts of the standard RDBMS out to and design the solution to our specific uses • NoSQL databases have been around for ages under different names/solutions 59 http://brillix.co.il
  • 60.
    Example Comparison: RDBMSvs. Hadoop 60 Typical Traditional RDBMS Hadoop Data Size Gigabytes Petabytes Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing) http://brillix.co.il
  • 61.
    Best Used For: Structured or Not (Flexibility)  Scalability of Storage/Compute  Complex Data Processing  Cheaper compared to RDBMS Relational Database Best Used For:  Interactive OLAP Analytics (<1sec)  Multistep Transactions  100% SQL Compliance Best when used together Hadoop And Relational Database 61 http://brillix.co.il
  • 62.
    The NOSQL Movement •NOSQL is not a technology – it’s a concept • We need high performance, scale out abilities or agile structure • We are willing to sacrifice our sacred database cows: consistency, transactions, durability • Over 150 different brands and solutions (http://nosql-database.org/). 62 http://brillix.co.il
  • 63.
    Is NoSQL aRDMS Replacement? NO 63 Well... Sometimes it does… http://brillix.co.il
  • 64.
    NoSQL Taxonomy Type Examples Key-ValueStore Document Store Column Store Graph Store http://brillix.co.il64
  • 65.
    Key Value Store •Distributed hash tables • Very fast to get a single value • Examples: • Amazon DynamoDB • Berkeley DB • Redis • Riak • Cassandra 65 http://brillix.co.il
  • 66.
    Document Store • Similarto Key/Value, but value is a document • JSON or something similar, flexible schema • Agile technology • Examples: • MongoDB • CouchDB • CouchBase 66 http://brillix.co.il
  • 67.
    What is aColumn Store Database? • Column Store databases are management systems that uses data managed in a columnar structure format for better analysis of single column data (i.e. aggregation). Data is saved and handled as columns instead of rows. • Examples: • HP Vertica • Pivotal (EMC) GreenPlum • Hadoop Hbase • Amazon’s SimpleDB • Cassandra http://brillix.co.il67
  • 68.
    Query Data • Whenwe query data, records are read at the order they are organized in the physical structure • Even when we query a single column, we still need to read the entire table and extract the column Row 1 Row 2 Row 3 Row 4 Col 1 Col 2 Col 3 Col 4 Select Col2 From MyTable Select * From MyTable http://brillix.co.il68
  • 69.
    How Does ColumnStores Keep Data Organization in row store Organization in column store http://brillix.co.il69 Select Col2 From MyTable
  • 70.
    Row Format vs.Column Format http://brillix.co.il71
  • 71.
    Graph Store • Inspiredby the graph theory • Data model: nodes, relationships, properties on both sides • Relational database have a hard time to represent a graph in the Database • Example: • Neo4j • InfiniteGraph • RDF 72 http://brillix.co.il
  • 72.
  • 73.
    Conclusion • We doBig Data to gain Value. Without value, there is no Big Data • Handling Big Data is a challenge – we talked about who uses it, when and where • Hadoop is a solution for Big Data usages but it’s not a magical solution • NoSQL, NewSQL and RDBMS are all solutions we can integrate for different usages • New organizational positions: cluster devops and data scientist. http://brillix.co.il74
  • 74.
  • 75.
    Thank You Zohar Elkayam twitter:@realmgic [email protected] www.realdbamagic.com http://brillix.co.il76