BIG
DATA
Basics
The BIGGER picture...
BIG DATA
PRESENTED BY: SUBMITTED TO:
RONICA GUPTA PROF. ALKA JINDAL
16104048 COMPUTER SCIENCE DEPARTMENT
[email protected]
How big is BIG DATA?
Big data is the term for
a collection of data sets
so large and complex
that it becomes difficult
to process using on-
hand database
management tools or
traditional data
processing applications.
The Depth of BIG DATA
● Walmart handles more than 1 million customer transactions every hour.
● Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
● 230+ millions of tweets are created every day.
● More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide.
● YouTube users upload 48 hours of new video every minute of the day.
● Amazon handles 15 million customer clickstream user data per day to
recommend products.
● 294 billion emails are sent every day. Services analyses this data to find the
spams.
● Modern cars have close to 100 sensors which monitors fuel level, tire pressure
etc. , each vehicle generates a lot of sensor data.
BIG DATA Characteristics
Volume
● The name ‘Big Data’ itself is related to a size which is enormous.
● Volume is a huge amount of data.
● To determine the value of data, size of data plays a very crucial role. If
the volume of data is very large then it is actually considered as a ‘Big
Data’.
● Hence while dealing with Big Data it is necessary to consider a
characteristic ‘Volume’
Variety
Structured Semi-Structured Unstructured
● stored and processed ● some organizational ● unknown form
in a fixed format properties like tags ● can’t be analyzed
● easy to process and markers unless transformed
Velocity
● Velocity refers to the high speed of accumulation of data.
● In Big Data velocity data flows in from sources like machines,
networks, social media, mobile phones etc.
● There is a massive and continuous flow of data. This determines
the potential of data that how fast the data is generated and
processed to meet the demands.
● Sampling data can help in dealing with the issue like ‘velocity’.
Veracity
● It refers to inconsistencies and uncertainty in data, that is data which is
available can sometimes get messy and quality and accuracy are difficult to
control.
● Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.
Value
● Data in itself is of no use or importance but it needs to be
converted into something valuable to extract Information
● Value denotes the added value for companies. Many companies
have recently established their own data platforms, filled their data
pools and invested a lot of money in infrastructure.
Accessing BIG DATA
DATA DATA DATA
DATA MINING DATA CLEANING VISUALIZATION
Data sets come in
STORAGE ANALYSIS
Data mining is This is the part
all shapes and The major
the process of Once all the data that takes all the
sizes. Before one difficulty with Big
discovering has been collected
can even think Data is managing work done prior
it needs to be
insights within about how the how it will be and outputs a
analysed to look
a database. data will be stored, stored. visualisation that
for interesting
it needs to be in
patterns and ideally anyone
an acceptable
trends. can understand.
format.
BIG DATA Tools
BIG DATA Applications
BIG DATA Challenges
THANK YOU!
ANY QUESTIONS?
What happened that
made data
“BIG”
A presentation on BIG DATA for beginners
BIG DATA
Submitted by Submitted to
Abhinav Chadha Prof. Alka Jindal
16104122 Computer Science Department
[email protected]
What actually is
BIG DATA?
BIG DATA refers to huge volume
of DATA that cannot be stored
and processed using the
traditional database and
software techniques
How huge does a DATA needs to be?
100 MB
It’s all
Relative!
Why study BIG
DATA?
48hrs
90% Of video is uploaded every minute
on YOUTUBE!
Of all data in this world is
created in past two years 322 PB
Worth Image files are sent
every hour over Whatsapp
How BIG DATA works?
1. Integrate 2. Manage 3. Analyze
During integration, you need You can store your data in Build data models with
to bring in the data, process any form you want and bring machine learning and
it, and make sure it’s your desired processing artificial intelligence. Put
formatted and available in a requirements and necessary your data to work.
form that your business process engines to those
analysts can get started data sets on an on-demand
with. basis
BIG DATA a concept!
BIG DATA ANALYTICS
● used by companies
to facilitate their
growth and
development
● This involves
applying various
data mining
algorithms on the
given set of data
● aid them in better
decision making.
What is analysis of BIG DATA important?
How BIG DATA works?
Three Stages in
BIG DATA Analytics
Descriptive
Analysis
This analysis performs in-depth
analysis of historical data to
reveal details such as underlying
reason for failures
Descriptive analytics answers the
question
‘What happened in the business’?
Predictive
Analysis
In predictive analytics, machine
learning and statistics tools are
used to predict the future.
Predictive analytics analyse
previous reports to predict the
future
‘What could happened’?
Prescriptive
Analysis
This analysis minimize or
maximize the manufacturing cost,
reframe marketing policies etc.
Prescriptive analytics answer the
question
‘What should we do’?
It’s not the end,
It's the beginning!
Feel free to ask
questions
Technologies and Applica
of Big Data
Submitted by: Soumy Jain
[email protected]Big Data Technology can be defined as a Software-Utility that is designed to Ana
Extract the information from an extremely complex and large data sets which the
Processing Software could never deal with.
Big Data Technology is mainly classified into two types:
Operational Big Data Technologies
Analytical Big Data Technologies
Firstly, The Operational Big Data is all about the normal day to day data that we g
be the Online Transactions, Social Media, or the data from a Particular Organisat
even consider this to be a kind of Raw Data which is used to feed the Analytical
Technologies.
Big Data Technologies
Big Data Technologies in Data Storage.
Hadoop
Hadoop Framework was designed to store and process data in a Distributed Data
with commodity hardware with a simple programming model. It can Store and An
different machines with High Speeds and Low Costs.
Developed by: Apache Software Foundation in the year 2011 10th of Dec.
Written in: JAVA
Current stable version: Hadoop 3.11
MongoDB
The NoSQL Document Databases like MongoDB, offer a direct alternative to the rig
Relational Databases. This allows MongoDB to offer Flexibility while handling a wide
large volumes and across Distributed Architectures.
Developed by: MongoDB in the year 2009 11th of Feb
Written in: C++, Go, JavaScript, Python
Current stable version: MongoDB 4.0.10
Rainstor
RainStor is a software company that developed a Database Management System
designed to Manage and Analyse Big Data for large enterprises. It uses Deduplica
organize the process of storing large amounts of data for reference.
Developed by: RainStor Software company in the year 2004.
Works like: SQL
Current stable version: RainStor 5.5
Hunk
Hunk lets you access data in remote Hadoop Clusters through virtual indexes and
Search Processing Language to analyse your data. With Hunk, you can Report a
amounts from your Hadoop and NoSQL data sources.
Developed by: Splunk INC in the year 2013.
Written in: JAVA
Current stable version: Splunk Hunk 6.2
Components and ecosystem of bigd
Technologies
Business intelligence
Cloud computing
Databases
Techniques for analyzing Data
A/B testing
Machine learning
Natural language processing
Visualization
Charts
Graphs
Multidimensional big data can also be represented as da
mathematically, tensors.
Array Database Systems have set out to provide storage
query support on this data type.
Additional technologies being applied to big data includ
based computation such as multilinear subspace learning
parallel-processing (MPP) databases, search-based appli
mining, distributed file systems, distributed cache (e.g., bu
Memcached), distributed databases, cloud and HPC-bas
(applications, storage and computing resources)and the
Although, many approaches and technologies have bee
still remains difficult to carry out machine learning with big
Applications
Government
The use and adoption of big data within governmental processes allows efficie
productivity, and innovation
CRVS (Civil Registration and Vital Statistics) collects all certificates status from b
source of big data for governments.
Manufacturing
Big data provides an infrastructure for transparency in the manufacturing indu
unravel uncertainties such as inconsistent component performance and availa
Predictive manufacturing as an applicable approach toward near-zero downt
requires a vast amount of data and advanced prediction tools for a systematic
useful information
Healthcare
Big data analytics has helped healthcare improve by providing personalized me
prescriptive analytics, clinical risk intervention and predictive analytics, waste an
reduction, automated external and internal reporting of patient data, standard
and patient registries and fragmented point solutions
Human inspection at the big data scale is impossible and there is a desperate n
for intelligent tools for accuracy and believability control and handling of inform
Education
Private bootcamps have been developed programs to meet big data demand
like The Data Incubator and programs like General Assembly
Media
Publishing environments are increasingly tailoring messages (advertisements) an
to appeal to consumers that have been exclusively gleaned through various da
• Targeting of consumers (for advertising by marketers)
• Data capture
• Data journalism: publishers and journalists use big data tools to provide uniqu
insights and infographics.
Internet of Things (IoT)
Big data and the IoT work in conjunction. Data extracted from IoT devices provi
device interconnectivity. Such mappings have been used by the media industr
governments to more accurately target their audience and increase media eff
increasingly adopted as a means of gathering sensory data, and this sensory da
medical, manufacturing and transportation contexts.
How is Big Data stored and
processed?
Aniket Tiwari
16104093
[email protected]
How is Big Data stored and processed?
There are two approaches for storing and processing
Big Data:-
• Traditional approach
• Modern approach
Traditional Approach
• The data that is being generated is given as an input
to the ETL System.
• An ETL System, would then Extract this data, and
transform it.
• Now the end users can generate reports and perform
analytics, by querying this data.
• But as the data grows, it becomes a very challenging
task to manage and process this data.
Drawbacks of Traditional Approach
• It an expensive system
• No scalability
• It is time consuming
Modern Approach
• Hadoop is used to store and process, a huge volume
of data, efficiently.
• Hadoop has two components – HDFS and
MapReduce.
• HDFS takes care of storing and managing the data
within the Hadoop Cluster.
• MapReduce takes care of processing and computing
the data, that is present within the HDFS.
Hadoop Cluster
• Hadoop cluster comprises of – Master Node,
Slave Node and a Secondary Node.
How does Hadoop work?
• HDFS divides data into chunks
• Each part of data is stored into a separate
node
• Each part is replicated on other nodes to
increase availability and decrease latency time
Features provided by Hadoop
• Cost effective
• Cluster of Nodes
• Parallel processing
• Distributed data
• Automatic Fail Over Management
• Data locality optimization
• Heterogeneous Cluster
• Scalability
APACHE PIG
Vandit Goel
16104034
[email protected]
INTRODUCTION
TO PIG
Introduction
● Apache Pig is a platform for data analysis. It is an alternative to
MapReduce Programming.
● Pig was developed as a research project at Yahoo in 2006.
Why Pig?
● Writing mappers and reducers by hand takes a long time.
● Pig introduces Pig Latin, a scripting language that lets you use SQL-like
syntax to define your map and reduce steps.
● Highly extensible with user-defined functions (UDF’s).
Pig Architecture
Pig Architecture
1. Parser
At first, all the Pig Scripts are handled by the Parser. Parser basically checks
the syntax of the script, does type checking, and other miscellaneous checks.
Afterwards, Parser’s output will be a DAG (directed acyclic graph) that
represents the Pig Latin statements as well as logical operators.
The logical operators of the script are represented as the nodes and the data
flows are represented as edges in DAG (the logical plan).
Pig Architecture
2. Optimizer
Afterwards, the logical plan (DAG) is passed to the logical optimizer. It carries
out the logical optimizations.
3. Compiler
Then compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Pig Architecture
4. Execution engine
Eventually, all the MapReduce jobs are submitted to Hadoop in a sorted order.
Ultimately, it produces the desired results while these MapReduce jobs are
executed on Hadoop.
Working of Pig
How twitter used Apache Pig to analyse
their large data set ?
Twitter had both semi-structured data like Twitter Apache logs, Twitter search
logs, Twitter MySQL query logs, application logs and structured data like
tweets, users, block notifications, phones, favorites, saved searches, re-tweets,
authentications, SMS usage, user followings, etc. which can be easily
processed by Apache Pig.
Twitter dumps all its archived data on HDFS. It has two tables i.e. user data and
tweets data. User data contains information about the users like username,
followers, followings, number of tweets etc. While Tweet data contains tweet,
its owner, number of re-tweets, number of likes etc. Now, twitter uses this data
to analyse their customer’s behaviors and improve their past experiences.
How twitter used Apache Pig to analyse
their large data set ?
The step by step solution of this problem is shown in the above image.
STEP 1– First of all, twitter imports the twitter tables (i.e. user table and tweet
table) into the HDFS.
STEP 2– Then Apache Pig loads (LOAD) the tables into Apache Pig framework.
STEP 3– Then it joins and groups the tweet tables and user table using GROUP
command.
How twitter used Apache Pig to analyse
their large data set ?
STEP 4– Then the tweets are counted according to the users using COUNT
command. So, that the total number of tweets per user can be easily
calculated.
STEP 5– At last the result is joined with user table to extract the user name
with produced result.
STEP 6– Finally, this result is stored back in the HDFS..
How twitter used Apache Pig to analyse their large data set ?
Properties of Pig?
● Pig process data in parallel on the Hadoop cluster.
● It provides a language called as “Pig Latin” to express data flows.
● Pig Latin contains operators for many of the traditional data operation
such as LOAD, STORE, Filter, Foreach, Group By, Distinct etc.
● It allows users to develop their own functions (user defined functions) for
reading, processing and writing data.