0% found this document useful (0 votes)
11 views66 pages

Big Data Analytics - Overview

The document provides an overview of Big Data Analytics, discussing its significance, sources, and the technologies used to manage and analyze large data sets. It highlights the explosive growth of data, the challenges posed by its volume, variety, and velocity, and the emergence of new technologies like NoSQL databases and Apache Hadoop. Additionally, it outlines the skills required for data science and the applications of big data in various sectors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views66 pages

Big Data Analytics - Overview

The document provides an overview of Big Data Analytics, discussing its significance, sources, and the technologies used to manage and analyze large data sets. It highlights the explosive growth of data, the challenges posed by its volume, variety, and velocity, and the emergence of new technologies like NoSQL databases and Apache Hadoop. Additionally, it outlines the skills required for data science and the applications of big data in various sectors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Overview of Big

Data Analytics MCSD 1053


DATA SCIENCE GOVERNANCE
Framework
OVERVIEW 1- BIG DATA ANALYTICS FRAMEWORK
Contents of the Overview

• What is Big Data? Why Should We Care?


• Who Uses Big Data & How?
• Big Data Skills & Technologies
Part 1

What is Big Data?


Why should We Care?

Copyright © 2015 Andy Koronios and Jing Gao, All Rights Reserved.
Some things are SO BIG,
that they have implication for
EVERYONE!
Big Data is one of those things
Source: Peppard 2011
Everything, Everywhere,
Intelligent, Instrumented &
Interconnected world!

The Internet of Things


More Data, More Often, From More Sources…..
Big Data Sources
COMPARATIVE VOLUMES

• KLSE STOCK EXCHANGE –


1TB/DAY

• ENTERPRISE DATA • MALAYSIA AIRLINES –


WAREHOUSE – 1 TB 1TB/MIN

Square Kilometre Array (SKA) radio telescope could generate more data per day than the entire Internet

11
Data Volumes Are Exploding!
Torrents of data • 40% increase per year
• 90% generated in the
last 2 years
• Only 5% is structured
Social Media & World Population
2014
2009

2010

Source:
McCridle Social Media Highlights, 2014
Everything we
do is leaving a
digital trace
Laptops and Smartphones Lead Data Traffic Growth
• 92 percent: Compound annual growth rate in data
traffic from 2010 to 2015.
• 5.6 billion: Number of personal devices connected to
mobile networks by 2015.
• 1.5 billion: Number of machine-to-machine nodes. 66
percent: Portion of data traffic allocated to video by
2015.
• 159 percent: Increase in global mobile data traffic
from 2009 to 2010.
• 129 percent: Compound annual growth rate of
mobile data traffic growth projected in Middle East
and Africa over 2010 to 2015.
• 248 petabytes: Amount of monthly data expected
from tablets in 2015. That's more than the entire
global mobile network in 2010.
• 295 petabytes: Amount of mobile data traffic
expected to come from machine-to-machine
connections in 2015.
• 613 kbps: Average smartphone connection speed in
2009.
• 4,404 kbps: Average smartphone connection speed in
2015.
The V’s of Big Data
‘Big’Data @ Rest & In Motion
t Millions of times
R e s
@ more data
a
Dat
DWH

Thousands of
times faster
D a ta i n
M o ti o n

0011010100100100100110100101010011100101001111001000100100010010001000100101
Data
Velocity?

■ Speed data is generated


■ Speed data is transferred
& analyzed

18
SELF-DRIVING CAR

•Data analyzed while generated, in memory

19
BIG DATA VARIETY
LIFE-CRITICAL DATA
4 Distruptive Technology Clusters in
4th Industrial Revolution
Trends of New Business Models
New Value Pools
Digital Capacity Trends

Information
Flow

Information
Stock

Information
Computation
Data Treatment Trends

03 Real time data

No more random
02 sampling 04 Merged data sources

Cheap non-traditional Self-learning


01 data sources 05 algorithm

27
Data Sources Trends

DIGITALLY PASSIVELY AUTOMATICALLY GEOGRAPHICALLY CONTINUOUSLY


GENERATED PRODUCED COLLECTED OR TEMPORALLY ANALYSED
The data are created A by product of our There is a system in TRACKABLE Information is
digitally, and can be daily lives or place that extracts Mobile phone relevant to human
stored using a series interaction with and stores the location data or call well-being and
of ones and zeros, digital services relevant data as it is duration time. development and
and can be generated can be analysed in
manipulated by real-time
computers.
Data Types Trends

DATA EXHAUST PHYSICAL SENSORS ONLINE INFORMATION CITIZEN REPORTING OR


Passively collected Satellite or infrared imagery Web content such as news CROWD-SOURCED DATA
transactional data from of changing landscapes, media and social media Information actively
people’s use of digital traffic patterns, light interactions (e.g. blogs, produced or submitted by
services (mobile phones, emissions, urban Twitter), news articles citizens through mobile
purchases, web searches) development and obituaries, e-commerce, job phone-based surveys,
and/or operational metrics topographic changes, etc; postings; this approach hotlines, user- generated
and other real-time data this approach focuses on considers web usage and maps, etc; While not
collected by agencies to remote sensing of changes content as a sensor of passively produced, this is a
monitor their projects and in human activity human intent, sentiments, key information source for
programmes (stock levels, perceptions, and want. verification and feedback
school attendance)

1 2 3 4
How well can we analyze
and use this ever increasing
volumes of data?
So what about ‘big data’?
1. A problem
The Volume, Variety and Velocity of data generated are stressing our IT
Systems and ability to handle the data

2. A Capability
That will allow us to squeeze more value from data

3. An Opportunity
To optimise processes, enhance decision-making & monetise data through
new business models
Part 2
Who Uses Big Data and How?
Big Data Analytics

Valuable
Advanced Insights
Big Data Analytics

Structured or Statistical Methods Identify Patterns


Unstructured Machine Learning Predict & Forecast
Artificial Intelligence… Optimization
Decision Making
Example:

Talent Scouting??
Data-Driven-Decision
Part 3
What are the Skill Sets and
Technologies underpinning Big Data?

Copyright © 2015 Andy Koronios and Jing Gao, All Rights Reserved.
Big Data Technology
• Relational databases failed to store and process Big Data.
• As a result, a new class of big data technology has emerged
and is being used in many big data analytics environments.
What is Big Data Technology?
• Set of tools or mechanisms that can make your computer process data that is
too big for it
High-level Declarative Language For Writing
Queries And Data Analysis

• PIG from Yahoo / Apache


• JAQL from IBM
• Hive from Facebook, etc
NoSQL Databases & Data Management Tools

• Store and manage data not using Structured Query Language


(SQL), relational database schema, or other common relational
database internal operations
• Non- relational database management system used where no fix
schemas are required and data is scaled horizontally
Categories of NoSQL databases
• KEY-VALUE PAIR (e.g. Cassandra)
• keys used to get Value from opaque Data blocks ♣ Hash map ♣ Tremendously fast
• DOCUMENT DATABASE (e.g. MongoDB and CouchDB)
• Again a key value store but value is in form of document. • Documents are not of fixed schemas •
Documents can be nested • Queries based on content as well as keys • Use cases: blogging
websites
• COLUMNAR DATABASE (eg. Microsoft Columnstore, SAP HANA)
• Works on attributes rather than tuples ♣ Key here is column name and value is contiguous column
values ♣ Best for aggregation queries ♣ Trend : select (1 or 2 column’s values ) where ( same or the
other column value ) = some value.
• GRAPH DATABASES (e.g. Neo4j and Giraph)
• A collection of nodes and edges • Nodes represent data while edge represent link between them •
Most dynamic and flexible
Apache Hadoop
• Open Source software framework
• Distributed, scalable system for large data sets on
commodity hardware
• Top-level written in Java
• Architecture:
• File system - Hadoop Distributed File Systems (HDFS)
• Process (Programming model) – MapReduce
• Major Users: Facebook, Yahoo, Amazon.com, Microsoft,
etc
Apache Hadoop Ecosystem
Hadoop Distributed File System (HDFS)

File NN

DN DN
DN
Client
DN DN

• HDFS stores data in distributed,scalable and fault- tolerant way


• Name node (NN) have metadata about data on DataNodes (DN)
• DN actually have data on them in form of blocks and they are capable of communicating
• Data is stored in form of compressed files across n number of commodity servers
• Data is stored in form of tables and columns with relation in them
Map Reduce
JT
File NN

TT DN DN TT
DN
Client
TT DN DN TT

• Mappers extract data from HDFS and put into maps


• Reducers aggregate the results produced by the mappers
• Job Tracker (JT) is the server component
• Find how many blocks in data
• Contact NN
• Send program to data node
• Task Tracker (TT) is the slave component
• Complete process in its DN
The Apache Hadoop Family
Name Description
Hadoop Common Common utilities
HDFS Distributed file system

YARN Job scheduling and cluster resource management


MapReduce Parallel processing of large data sets
Chukwa A data collection system for large distributed systems
HBase Scalable distributed db supporting structured data storage
Hive A DWH infrastructure providing logic-driven, ad hoc querying &
can cater unstructured data.
Mahout Scalable machine learning & data mining library (e.g. k-means
for data clustering, random forest and logistic regression for
data classification). Used widely to develop recommender
system for online businesses.
The Apache Hadoop Family
(cont.)
Name Description

Pig High level, procedural, data-flow language to process data, speed up code
and make it handier. It can extract, transform and load data (ETL)
Zookeeper High performance co-ordination service for distributed applications

Flume Responsible for collecting, aggregating and moving data into HDFS.

Sqoop (SQL to Hadoop) To transfer data between Hadoop clusters and relational databases (such as
Oracle or Microsoft SQL Server) that traditionally use SQL instructions.
Kerberos Provides authentication services in Hadoop clusters

Serengeti Virtualization tool that helps build virtual Hadoop cluster in the cloud

Spark Processing engine that performs at speeds up to 100 times faster than Map
Reduce for iterative algorithms or interactive data mining. Provides in-
memory cluster computing for speed. Supports Java, Scala, and Python APIs
and combines SQL, streaming and complex analytics .Hadoop, Mesos,
standalone, or in the cloud. Can access diverse data sources such as HDFS,
Cassandra, HBase, or S3
BIG DATA ARCHITECTURE
Data Sources Data
Data Storage Data Provisioning Data Discovery Applications
Consolidation
Structured & Unstructured jConnect jView Analytics Services

Internal Data dictionaries & data


System model descriptive Routine
select shard 1
report
Operational Business rules & dictionaries Client
Systems New data diagnostic
shard 2 s
extract Dashboard Business

Security & access control


Web data
• Social media predictive users
• News

……
Data Lake
Staging Database

• Forum Business
• Open/public transform Alert analysts
shard n prescriptive
Mixed media Data
Incremental scientists
integrate jClean visualization OLAP analysis
data
Machine data Knowledge
self-service Workers
datamart 1 dynamic extract Web apps
Spatial temporal load
data Smart Apps
Data Warehouse autonomous
• Map
• Land generation
datamart 2
Integrity Mobile apps
External data Data stream check Tools

……
Bigchain DB Smart Data Lake
• Weather data (Future) (Future)
• Commercial Automated
BI data
• Stock data datamart n system

replication archives
JCORP DATA ADVISORY: MongoDB Data Lake Design

COLLECTION: HR COLLECTION: FINANCE COLLECTION:


NEWS & SOCIAL
MEDIA
MANAGEMENT DASHBOARD: DATA SOURCES

INTERNAL SERVER SEMI-PRIVATE


SERVER

SAP HANA DB SQL SERVER SQL SERVER


DB DB

1. Projects Data
1. HR Data 1. Finance Data
2. Procurement stored-procedure
2. Intrapreneur Data
Data
MANAGEMENT DASHBOARD: DATA SOURCES

Data Data
Dashboard Info System Database Data Extraction Data Captured
Frequency Relationship

• Finance Periodic Financial SQL Server System Report -> Excel Monthly – 1. Closing Account
• Intrapreneur Reporting (FRP) & Sheet export to • Revenue
Account shared folder • Income
Consolidation Share folder-> Statement
Script -> MongoDB • Dimension
• Disclosure
• Corporate Info
2. Flows
3. Investment
4. Intercompany
5. Partner
6. Dimension
MANAGEMENT DASHBOARD: DATA SOURCES
Data
Dashboard Info System Database Data Extraction Data Frequency Data Captured
Relationship

• HR SAP SAP HANA SAP Ad Hoc Monthly – 1. HR Info


• Procureme Query -> Excel export to shared • New Staff
nt Sheet folder • Active
Staff
Share folder-> • Resigned
Script -> Staff
MongoDB • Payroll
(Departm
ent)

2. Material
Management
1. Purchasin
g
Projects UI Template / SQL Server DB Connection Monthly – direct 1. Projects
Excel (Internal connection 1. Project
Access) Info
2. Project
Timeline
SOCIAL MEDIA DASHBOARD

Dashboard Info System Analysis Method

• JCORP • In house 1. Sentiment analysis Vader Sentiment


• KPJ development
• QSR (KFC & 2. Social media engagement WordToVec
PIZZAHUT)
3. Related News Machine Learning

4. Related Word cloud


KPJ ANALYTICS DASHBOARD

Dashboard Info System Analysis Method


• KPJ Internal system (KPJ) Descriptive Deep Learning
Operation
• KPJ Predictive
Analytics
Random Forest
Process Job for Data Source Data Lake in MongoDB

Data
Data Sources Consolidation
Data Storage (DATA LAKE in MONGODB)

STAGING DATABASE CREATE API


Source: SAP HANA
HR (Personnel data) API: HR
HR1
PROCUREMENT HR2 HR DATA
HR3 TEMPLATE

HR4

EKKO
Source: SQL SERVER EKKN API: PROC
FINANCE EKPO

PROJECT MANAGEMENT IVR FINANCE DATA API:


TEMPLATE FINANCE
PFR

ACT

1. Data prepared by JCORP from server. API was created to MONGODB (flat file) was created and ready to be used for
2. Staging Database have 2 process: upload data from descriptive and diagnostic analysis (e.g. In TABLEAU).
a) PROCESS 1: Staging DB to
• For HR and FINANCE, the data was cleaned and prepared based on the template MONGODB
given by UTM Data team.
b)PROCESS 2:
• For PROC, the data from JCORP will be used directly to run the API.
Process Job for Data Lake (MongoDB) Data Warehouse /
Mart in MySQL Server

Data Storage
DATA LAKE (MONGODB)

CREATE API
MONGODB (flat file) was created and ready to be used for API created to Data warehouse to be used especially for predictive and prescriptive analysis.
analysis in TABLEAU. extract data
from data mart
into data
warehouse in
MySQL server
DASHBOARD - FINANCIAL
DASHBOARD – HUMAN RESOURCE
DASHBOARD - PROJECTS
DASHBOARD – KPJ OPERATIONS & PERFORMANCE
DASHBOARD – SOCIAL MEDIA
THANK YOU

In the Name of God for Mankind


www.utm.my

You might also like