Overview of Big
Data Analytics MCSD 1053
DATA SCIENCE GOVERNANCE
Framework
OVERVIEW 1- BIG DATA ANALYTICS FRAMEWORK
Contents of the Overview
• What is Big Data? Why Should We Care?
• Who Uses Big Data & How?
• Big Data Skills & Technologies
Part 1
What is Big Data?
Why should We Care?
Copyright © 2015 Andy Koronios and Jing Gao, All Rights Reserved.
Some things are SO BIG,
that they have implication for
EVERYONE!
Big Data is one of those things
Source: Peppard 2011
Everything, Everywhere,
Intelligent, Instrumented &
Interconnected world!
The Internet of Things
More Data, More Often, From More Sources…..
Big Data Sources
COMPARATIVE VOLUMES
• KLSE STOCK EXCHANGE –
1TB/DAY
• ENTERPRISE DATA • MALAYSIA AIRLINES –
WAREHOUSE – 1 TB 1TB/MIN
Square Kilometre Array (SKA) radio telescope could generate more data per day than the entire Internet
11
Data Volumes Are Exploding!
Torrents of data • 40% increase per year
• 90% generated in the
last 2 years
• Only 5% is structured
Social Media & World Population
2014
2009
2010
Source:
McCridle Social Media Highlights, 2014
Everything we
do is leaving a
digital trace
Laptops and Smartphones Lead Data Traffic Growth
• 92 percent: Compound annual growth rate in data
traffic from 2010 to 2015.
• 5.6 billion: Number of personal devices connected to
mobile networks by 2015.
• 1.5 billion: Number of machine-to-machine nodes. 66
percent: Portion of data traffic allocated to video by
2015.
• 159 percent: Increase in global mobile data traffic
from 2009 to 2010.
• 129 percent: Compound annual growth rate of
mobile data traffic growth projected in Middle East
and Africa over 2010 to 2015.
• 248 petabytes: Amount of monthly data expected
from tablets in 2015. That's more than the entire
global mobile network in 2010.
• 295 petabytes: Amount of mobile data traffic
expected to come from machine-to-machine
connections in 2015.
• 613 kbps: Average smartphone connection speed in
2009.
• 4,404 kbps: Average smartphone connection speed in
2015.
The V’s of Big Data
‘Big’Data @ Rest & In Motion
t Millions of times
R e s
@ more data
a
Dat
DWH
Thousands of
times faster
D a ta i n
M o ti o n
0011010100100100100110100101010011100101001111001000100100010010001000100101
Data
Velocity?
■ Speed data is generated
■ Speed data is transferred
& analyzed
18
SELF-DRIVING CAR
•Data analyzed while generated, in memory
19
BIG DATA VARIETY
LIFE-CRITICAL DATA
4 Distruptive Technology Clusters in
4th Industrial Revolution
Trends of New Business Models
New Value Pools
Digital Capacity Trends
Information
Flow
Information
Stock
Information
Computation
Data Treatment Trends
03 Real time data
No more random
02 sampling 04 Merged data sources
Cheap non-traditional Self-learning
01 data sources 05 algorithm
27
Data Sources Trends
DIGITALLY PASSIVELY AUTOMATICALLY GEOGRAPHICALLY CONTINUOUSLY
GENERATED PRODUCED COLLECTED OR TEMPORALLY ANALYSED
The data are created A by product of our There is a system in TRACKABLE Information is
digitally, and can be daily lives or place that extracts Mobile phone relevant to human
stored using a series interaction with and stores the location data or call well-being and
of ones and zeros, digital services relevant data as it is duration time. development and
and can be generated can be analysed in
manipulated by real-time
computers.
Data Types Trends
DATA EXHAUST PHYSICAL SENSORS ONLINE INFORMATION CITIZEN REPORTING OR
Passively collected Satellite or infrared imagery Web content such as news CROWD-SOURCED DATA
transactional data from of changing landscapes, media and social media Information actively
people’s use of digital traffic patterns, light interactions (e.g. blogs, produced or submitted by
services (mobile phones, emissions, urban Twitter), news articles citizens through mobile
purchases, web searches) development and obituaries, e-commerce, job phone-based surveys,
and/or operational metrics topographic changes, etc; postings; this approach hotlines, user- generated
and other real-time data this approach focuses on considers web usage and maps, etc; While not
collected by agencies to remote sensing of changes content as a sensor of passively produced, this is a
monitor their projects and in human activity human intent, sentiments, key information source for
programmes (stock levels, perceptions, and want. verification and feedback
school attendance)
1 2 3 4
How well can we analyze
and use this ever increasing
volumes of data?
So what about ‘big data’?
1. A problem
The Volume, Variety and Velocity of data generated are stressing our IT
Systems and ability to handle the data
2. A Capability
That will allow us to squeeze more value from data
3. An Opportunity
To optimise processes, enhance decision-making & monetise data through
new business models
Part 2
Who Uses Big Data and How?
Big Data Analytics
Valuable
Advanced Insights
Big Data Analytics
Structured or Statistical Methods Identify Patterns
Unstructured Machine Learning Predict & Forecast
Artificial Intelligence… Optimization
Decision Making
Example:
Talent Scouting??
Data-Driven-Decision
Part 3
What are the Skill Sets and
Technologies underpinning Big Data?
Copyright © 2015 Andy Koronios and Jing Gao, All Rights Reserved.
Big Data Technology
• Relational databases failed to store and process Big Data.
• As a result, a new class of big data technology has emerged
and is being used in many big data analytics environments.
What is Big Data Technology?
• Set of tools or mechanisms that can make your computer process data that is
too big for it
High-level Declarative Language For Writing
Queries And Data Analysis
• PIG from Yahoo / Apache
• JAQL from IBM
• Hive from Facebook, etc
NoSQL Databases & Data Management Tools
• Store and manage data not using Structured Query Language
(SQL), relational database schema, or other common relational
database internal operations
• Non- relational database management system used where no fix
schemas are required and data is scaled horizontally
Categories of NoSQL databases
• KEY-VALUE PAIR (e.g. Cassandra)
• keys used to get Value from opaque Data blocks ♣ Hash map ♣ Tremendously fast
• DOCUMENT DATABASE (e.g. MongoDB and CouchDB)
• Again a key value store but value is in form of document. • Documents are not of fixed schemas •
Documents can be nested • Queries based on content as well as keys • Use cases: blogging
websites
• COLUMNAR DATABASE (eg. Microsoft Columnstore, SAP HANA)
• Works on attributes rather than tuples ♣ Key here is column name and value is contiguous column
values ♣ Best for aggregation queries ♣ Trend : select (1 or 2 column’s values ) where ( same or the
other column value ) = some value.
• GRAPH DATABASES (e.g. Neo4j and Giraph)
• A collection of nodes and edges • Nodes represent data while edge represent link between them •
Most dynamic and flexible
Apache Hadoop
• Open Source software framework
• Distributed, scalable system for large data sets on
commodity hardware
• Top-level written in Java
• Architecture:
• File system - Hadoop Distributed File Systems (HDFS)
• Process (Programming model) – MapReduce
• Major Users: Facebook, Yahoo, Amazon.com, Microsoft,
etc
Apache Hadoop Ecosystem
Hadoop Distributed File System (HDFS)
File NN
DN DN
DN
Client
DN DN
• HDFS stores data in distributed,scalable and fault- tolerant way
• Name node (NN) have metadata about data on DataNodes (DN)
• DN actually have data on them in form of blocks and they are capable of communicating
• Data is stored in form of compressed files across n number of commodity servers
• Data is stored in form of tables and columns with relation in them
Map Reduce
JT
File NN
TT DN DN TT
DN
Client
TT DN DN TT
• Mappers extract data from HDFS and put into maps
• Reducers aggregate the results produced by the mappers
• Job Tracker (JT) is the server component
• Find how many blocks in data
• Contact NN
• Send program to data node
• Task Tracker (TT) is the slave component
• Complete process in its DN
The Apache Hadoop Family
Name Description
Hadoop Common Common utilities
HDFS Distributed file system
YARN Job scheduling and cluster resource management
MapReduce Parallel processing of large data sets
Chukwa A data collection system for large distributed systems
HBase Scalable distributed db supporting structured data storage
Hive A DWH infrastructure providing logic-driven, ad hoc querying &
can cater unstructured data.
Mahout Scalable machine learning & data mining library (e.g. k-means
for data clustering, random forest and logistic regression for
data classification). Used widely to develop recommender
system for online businesses.
The Apache Hadoop Family
(cont.)
Name Description
Pig High level, procedural, data-flow language to process data, speed up code
and make it handier. It can extract, transform and load data (ETL)
Zookeeper High performance co-ordination service for distributed applications
Flume Responsible for collecting, aggregating and moving data into HDFS.
Sqoop (SQL to Hadoop) To transfer data between Hadoop clusters and relational databases (such as
Oracle or Microsoft SQL Server) that traditionally use SQL instructions.
Kerberos Provides authentication services in Hadoop clusters
Serengeti Virtualization tool that helps build virtual Hadoop cluster in the cloud
Spark Processing engine that performs at speeds up to 100 times faster than Map
Reduce for iterative algorithms or interactive data mining. Provides in-
memory cluster computing for speed. Supports Java, Scala, and Python APIs
and combines SQL, streaming and complex analytics .Hadoop, Mesos,
standalone, or in the cloud. Can access diverse data sources such as HDFS,
Cassandra, HBase, or S3
BIG DATA ARCHITECTURE
Data Sources Data
Data Storage Data Provisioning Data Discovery Applications
Consolidation
Structured & Unstructured jConnect jView Analytics Services
Internal Data dictionaries & data
System model descriptive Routine
select shard 1
report
Operational Business rules & dictionaries Client
Systems New data diagnostic
shard 2 s
extract Dashboard Business
Security & access control
Web data
• Social media predictive users
• News
……
Data Lake
Staging Database
• Forum Business
• Open/public transform Alert analysts
shard n prescriptive
Mixed media Data
Incremental scientists
integrate jClean visualization OLAP analysis
data
Machine data Knowledge
self-service Workers
datamart 1 dynamic extract Web apps
Spatial temporal load
data Smart Apps
Data Warehouse autonomous
• Map
• Land generation
datamart 2
Integrity Mobile apps
External data Data stream check Tools
……
Bigchain DB Smart Data Lake
• Weather data (Future) (Future)
• Commercial Automated
BI data
• Stock data datamart n system
replication archives
JCORP DATA ADVISORY: MongoDB Data Lake Design
COLLECTION: HR COLLECTION: FINANCE COLLECTION:
NEWS & SOCIAL
MEDIA
MANAGEMENT DASHBOARD: DATA SOURCES
INTERNAL SERVER SEMI-PRIVATE
SERVER
SAP HANA DB SQL SERVER SQL SERVER
DB DB
1. Projects Data
1. HR Data 1. Finance Data
2. Procurement stored-procedure
2. Intrapreneur Data
Data
MANAGEMENT DASHBOARD: DATA SOURCES
Data Data
Dashboard Info System Database Data Extraction Data Captured
Frequency Relationship
• Finance Periodic Financial SQL Server System Report -> Excel Monthly – 1. Closing Account
• Intrapreneur Reporting (FRP) & Sheet export to • Revenue
Account shared folder • Income
Consolidation Share folder-> Statement
Script -> MongoDB • Dimension
• Disclosure
• Corporate Info
2. Flows
3. Investment
4. Intercompany
5. Partner
6. Dimension
MANAGEMENT DASHBOARD: DATA SOURCES
Data
Dashboard Info System Database Data Extraction Data Frequency Data Captured
Relationship
• HR SAP SAP HANA SAP Ad Hoc Monthly – 1. HR Info
• Procureme Query -> Excel export to shared • New Staff
nt Sheet folder • Active
Staff
Share folder-> • Resigned
Script -> Staff
MongoDB • Payroll
(Departm
ent)
2. Material
Management
1. Purchasin
g
Projects UI Template / SQL Server DB Connection Monthly – direct 1. Projects
Excel (Internal connection 1. Project
Access) Info
2. Project
Timeline
SOCIAL MEDIA DASHBOARD
Dashboard Info System Analysis Method
• JCORP • In house 1. Sentiment analysis Vader Sentiment
• KPJ development
• QSR (KFC & 2. Social media engagement WordToVec
PIZZAHUT)
3. Related News Machine Learning
4. Related Word cloud
KPJ ANALYTICS DASHBOARD
Dashboard Info System Analysis Method
• KPJ Internal system (KPJ) Descriptive Deep Learning
Operation
• KPJ Predictive
Analytics
Random Forest
Process Job for Data Source Data Lake in MongoDB
Data
Data Sources Consolidation
Data Storage (DATA LAKE in MONGODB)
STAGING DATABASE CREATE API
Source: SAP HANA
HR (Personnel data) API: HR
HR1
PROCUREMENT HR2 HR DATA
HR3 TEMPLATE
HR4
EKKO
Source: SQL SERVER EKKN API: PROC
FINANCE EKPO
PROJECT MANAGEMENT IVR FINANCE DATA API:
TEMPLATE FINANCE
PFR
ACT
1. Data prepared by JCORP from server. API was created to MONGODB (flat file) was created and ready to be used for
2. Staging Database have 2 process: upload data from descriptive and diagnostic analysis (e.g. In TABLEAU).
a) PROCESS 1: Staging DB to
• For HR and FINANCE, the data was cleaned and prepared based on the template MONGODB
given by UTM Data team.
b)PROCESS 2:
• For PROC, the data from JCORP will be used directly to run the API.
Process Job for Data Lake (MongoDB) Data Warehouse /
Mart in MySQL Server
Data Storage
DATA LAKE (MONGODB)
CREATE API
MONGODB (flat file) was created and ready to be used for API created to Data warehouse to be used especially for predictive and prescriptive analysis.
analysis in TABLEAU. extract data
from data mart
into data
warehouse in
MySQL server
DASHBOARD - FINANCIAL
DASHBOARD – HUMAN RESOURCE
DASHBOARD - PROJECTS
DASHBOARD – KPJ OPERATIONS & PERFORMANCE
DASHBOARD – SOCIAL MEDIA
THANK YOU
In the Name of God for Mankind
www.utm.my