0% found this document useful (0 votes)
44 views

ITP250 Lecture 11 (2) - 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

ITP250 Lecture 11 (2) - 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Trends in databases

ITP 249
Lecture 11
Outline
• Big data
– The Features of Big Data
– What Drives Big Data?
– Big Data Applications
– Real-time Analytics and Its Impact
• In-memory db
• Columnar db
• Limitations of SQL
• NoSQL db
What is big data?
• Something large and full of information?
– Maybe but provides no information of what Big Data really is
• Universal Definition
– Extremely large data sets
– Grown beyond capacity of traditional tools
– Also the processes of leveraging the data (e.g. Analytics, BI, Data Mining)
• What kind of data?
– Every day we create 2.5 quintillion (2.5 × 1018) bytes of data
– 90 % of the data in the world today has been created in the last two years
– Sensors (IoTs), Blogs, Pics, Videos, E-commerce, GPS, etc
• Analytics and Research defines Big Data today
– More data, more analysis, more results
– Presents opportunity for deep analysis, pattern prediction and correlation
Structured vs. Unstructured Data
Structured Unstructured

• Strictly organized, common schema • No uniform structure


• Designed for management by computers • Designed for use by humans & devices
• Relational databases & spreadsheets • Word docs, PDFs, emails, videos, IoT sensor
• Standard search operations data, audio files, emails, HTML, & images
• Limited data visibility
With the rise of 4K video, medical images, IoT, digital
information, AI and analytics, the data explosion is
accelerating.

3X
Number of enterprises with 1PB+ unstructured
data grows from 2016 to 10174

90%
500

80 % 375

of all data was created


in the last 2 years1 Unstructured data3 250

Unstructured
331 EB
© Copyright IBM Corporation 2017

Object based storage capacity by 20212


File and Object
125

Projected
Structured block storage
Exabytes

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

5
The growing imperative of Business Data
Analytics have emerged for …to massive Interactive,
years from Transactional, Unstructured content
Structured data… Documents
Web Pages

Sales
transactions
Cameras

80 %
Databases

Text Messages
Is Unstructured
Emails

6
Who is using Big Data?
– Science / Reaseach (NASA / NOAA
– Pharma / Health
– Energy
– Media and Entertainment
– Manufacturing
– Finance
– All Businesses today leverage some form of big data
– References:
• http://www.cnbc.com/id/100792215
• http://video.cnbc.com/gallery/?video=3000168940
• http://www.cnbc.com/id/100638376
The Features of Big Data
• 7 ‘V’s that describe the features of big data
Volume
• Volume of data collected, stored, and shared is growing
faster than ever before
• Not all data are stored. Some are discarded, others are
archived. Even then the total volume is growing
Variety
• Source of data
• Form of data
• Business data,
social media data
• Multiple languages
• Formats – text,
voice, photos,
video, audio
Velocity
• Speed at which data are generated and collected
• Can also refer to how quickly data can be
processed
Variability
• Changes in the meaning of data over time or in
context (asset class over time)
• Data of unknown or indistinct type or structure
or format (number, text, emoji, etc)
• Sentiment analysis uses natural language
processing to derive the attitude of the writer
Veracity
• Reliability or truthfulness of data
• Errors and inaccuracies
• Separating noise from signal
Volatility
• Lifespan of data
• How long data are available
• How long should it be stored
Value
• Driving force of big data analytics is value
• Should provide benefit to someone
• Providing big data itself is a business
• Evaluate the benefit of investing in big data
against the cost
What continues to Drives Big Data?
• World is becoming more digital
• World is becoming more connected
• Electronic/digital devices are becoming more
economical (putting technology in the hands of
more people)
• Traditional forms of social communications are
being replaced with digital ones that are often
‘free’
Not just the Data =>
Big Data Applications
• Business Intelligence -> AI
• AI –> Machine Learning -> Deep Learning
• Application Caetogies:
– Statistical Applications (Trends)
– Predictive Analysis (Trends -> Predictions)
– Data modeling/Data Visualization
– What If scenarios
In-memory Databases (IMDB)
• Using RAM instead of hard disk for the database
• All relevant data are in memory all the time
• Speeds up queries to provide real time or near real time analytics
capabilities
• Innovations
• Data are stored in RAM
• Use of columnar storage for the relational database.
• Indexing (is free with columnar storage)
• Data compression
• Parallel data processing
• Partitioning data
• SAP HANA is an IMDB
Real-time Analytics and Its Impact

• Provide almost instantaneous feedback from analytics processing


• React to changing customer needs
• React to opportunities in real time
• Example of customer service in credit card companies
• Customer navigates the website but cannot find resolution
• Calls customer service rep
• Real time analytics helps improve customer experience
• Re-order call tree to the customer’s most likely reason for calling
• Prepopulate the rep’s screens
• Eliminate options from the phone tree based on customer’s browsing history
• Change language of chat or call
• Make promotional offers to customer
In-Memory Appliance Development
• Drivers
– Big data
– Predictive analytics
– Real-time analytics
– Self-service BI

• Enabling hardware innovations


– High-capacity RAM
– Multi-core processor architectures
– Massive parallel scaling
– Massively parallel processing (MPP)
– Large symmetric multiprocessors (SMP)

Image Source: Ralokota, R. (May 15, 2011). New tools for new times – primer on big data, Hadoop
and “in-memory” data clouds. Retrieved from http://practicalanalytics.wordpress.com/2011/05/15/new-tools-for-new-times-a-primer-o
n-big-data/
Performance Bottleneck Comparison
• Without high-capacity RAM  With high-capacity RAM
− Database stored on disk
− Database stored in memory
− Bottleneck: Latency between disk
− Bottleneck: Latency between
and RAM
CPU and RAM
− Orders of magnitude response
time improvements

Image Source:Morrison
, A. (2012). The art and science of new analytics technology. PwC Technology Forecast, 1, 31-43. Retrieved from http://www.pwc.com/en_US/
us/technology-forecast/2012/issue1/features/feature-art-science-analytics-technology.jhtml
Software That Leverages Hardware Innovations

Source: Plattner
, H. & Zeier, A. (2011). In Memory Data Management: An Inflection Point for Enterprise Applications. Retrieved from http://www3.weforum.org/docs/GITR/2012/GI
TR_Chapter1.7_2012.pdf
Another Innovation - Columnar Databases
Advantages Disadvantages
• Better I/O bandwidth utilization  Load times can be slow
• Higher cache efficiency  Less efficient for transactional
• Faster data aggregation processes
• High compression rates  Possibly slower relational
interfaces
• Column-based parallel processing
Columnar Storage Example
Country Customer Product Sold Pieces

USA 3000 DXTR1100 5

USA 4000 DXTR1100 21

Germany 23000 DXTR3100 12

Germany 17000 DXTR3100 34

Row table Column table

Row 1 USA 3000 DXTR1100 5 Column1 Column2 Column3 Column4


USA 3000 DXTR1100 5
Row 2 USA 4000 DXTR1100 21
USA 4000 DXTR1100 21

Row 3 DE 23000 DXTR3100 12 Germany 23000 DXTR3100 12

Row 4 DE 17000 DXTR3100 34 Germany 17000 DXTR3100 34


Super Simple App & Schema
Monolithic ERP Application with super simple
schema:
• Employee
• Salary
• Department
Modern Apps (Mobile/Social)

A new app comes along that needs to be ‘internet scale’. What if in


your schema…
• You need to add or remove fields, lots of them, frequently?
• You need another table with a ‘variable’ schema?

What if for your infrastructure…


• You need to scale out not up
• Writes are as numerous as reads
• Data in volume is high and growth rate is high
• Use is decentralized (web, mobile, IoT)
Limitations of SQL (RDBMS)
• Rigid schema, not easy to add columns
(attributes) as needed
• JOINs are expensive!
• Transaction handling is complex with millions of
concurrent users
• Requires some downtime
• Unstructured data is not easily handled
• Not adaptive to new requirements
NoSQL
• Not Only SQL
• Not based on relational databases
• They may support SQL like querying
• Based on key-value pairs
• Schema-less
• ACID transactions may be compromised to
increase performance, availability, speed.
Eventually consistent.
SQL vs. NoSQL
Enter NoSQL Data stores
• Key-Value: amazon dynamo
• Column: cassandra
• Graph DB: neo4j
• Document: mongodb
Key-Value Stores
• Use case:
– Quick lookups with no ‘relational’ component (no
joins)
– Quick and high scalability
– Often (mostly) in memory
• Example:

• Application:
– User session data between shared applications
Column Stores
• Use case:
– Super scalable
– Map Reduce support
• Example

• Application:
– Large scale realtime data logging (Finance, Web Analytics)
Graph DB
• Use Case:
– Dense network of strongly connected entities
– Nodes and relationships
– Graph Data Modeling
• Example:

• Application:
– Facebook graph search, Google knowledge graph,
Twitter
Document Store
• Use case:
– Semi-structured data with SQL-like queries
– Collections of related key-value pairs with variable
schemas
• Example:

• Application:
– Document driven web or other applications
Distributed Computing
• Apache Hadoop
• Distributed computing
• Parallel processing
When to…
Use an RDBMS when you need/have... Use NoSQL when you need/have...

Centralized applications (e.g. ERP) Decentralized applications (e.g. Web,


mobile and IOT)
Moderate to high availability Continuous availability; no downtime
Moderate velocity data High velocity data (devices, sensors, etc.)
Data coming in from one/few locations Data coming in from many locations
Primarily structured data Structured, with semi/unstructured
Complex/nested transactions Simple transactions
Primary concern is scaling reads Concern is to scale both writes and reads
Philosophy of scaling up for more Philosophy of scaling out for more
users/data users/data
To maintain moderate data volumes with To maintain high data volumes; retain
purge forever
What if you have both?
(and they are Big)
• SQL-like Distributed Query engines:
– Hive
– Presto
– Drill
– Impala
– Spark SQL
– Lingual

• Distributed computing platforms:


– Hadoop
– Spark
– Tez
When in doubt ask…
• What is the application use case(s)?
• What is the application(s) data model?
• What is the need for scalability on reads/writes?
• What is the query pattern for the application or
users?

You might also like