Big Data
Big Data
Associate Professor
Department of Computer Science and Engineering
Session Objectives
• Cost Reductions
• Time Reductions
• New product development and optimize offerings.
• Smart Decision Making
Big Data
• A term for any collection of large and complex data sets.
• It is difficult to process using database management tools or traditional
data processing applications.
• The challenges include capture, refining, storage, search, sharing, transfer,
analysis and visualization.
• Most companies collect ‘millions’ of data items. Many more are available
via Google, Facebook, Twitter, Amazon, etc.
• These data are seldom structured.
• Many companies use “Big Data” for manual queries (marketing and
sales), to answer research questions etc.
• It is still not common to utilize Big Data automatically and
systematically within an algorithmic (forecasting) framework.
• We argue that such use will both contribute to both the analysis and to
the forecasting.
Big Data Types: 7 V s
Volume:
-Big data implies enormous volumes of data
-How much data is really relevant to the problem solution?
-Cost of processing?
-So, can you really afford to store and process all that data?
• Data Volume
– 44x increase from 2009 - 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data 7
Volume: Example
4.6
30 billion RFID
12+ TBs tags today billion
of tweet data (1.3B in 2005) camera
every day phones
world wide
100s of
millions
data every
? TBs of
of GPS
day
enabled
devices sold
25+ TBs of annually
log data 2+
every day billion
people on
76 million smart meters the Web
in 2009… by end
200M by 2014 2011
Variety:
-Variety refers to the many sources and types of data both structured and unstructured.
-A small fraction is structured formats, Relational, XML, etc.
-A fair amount is semi-structured, as web logs, etc.
-The rest of the data is unstructured text, photographs, etc.
-So, no single data model can currently handle the diversity
Different Types of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can only scan the data once
• A single application can be generating/collecting many types
of data
• Big Public Data (online, weather, finance, etc)
Mobile devices
(tracking all objects all the time)
Social media and networks Scientific instruments
(all of us are generating data) (collecting all sorts of data)
Sensor technology and networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from
the collected data in a timely manner and in a scalable fashion
12
Value:
- How much value is created for each unit of data (whatever it is)?
- So, what is the contribution of subsets of the data to the problem solution?
Validity:
Real-Time Analytics/Decision
Requirement
Volatility:
-Big data volatility refers to how long is data valid and how long should it be
stored.
-Real time data you need to determine at what point is data no longer relevant
to the current analysis.
-Decision on to how long keep the data?
TRADITIONAL BIG DATA ARCHITECTURE
STREAMING BIG DATA ARCHITECTURE
LAMBDA BIG DATA ARCHITECTURE
KAPPA BIG DATA ARCHITECTURE
Limitation of Big Data
Prioritizing correlations
• Data analysts use big data to tease out correlation: when one variable is
linked to another.
• However, not all these correlations are substantial or meaningful.
• More specifically, just because 2 variables are correlated or linked
doesn’t mean that a relationship exists between them.
Security
• As with many technological endeavors, big data analytics is prone to
data breach.
• The information that you provide a third party could get leaked to
customers or competitors.
Limitation of Big Data
Transferability
• Because much of the data you need analyzed lies behind a firewall or on a
private cloud, it takes technical know-how to efficiently get this data to
an analytics team.
• Furthermore, it may be difficult to consistently transfer data to specialists
for repeat analysis.