0% found this document useful (0 votes)
18 views12 pages

Big Data Unit1

The document provides an overview of Big Data Analytics, including definitions, types of data (structured, unstructured, semi-structured), and the Five Vs (Volume, Velocity, Variety, Veracity, Value) that characterize Big Data. It discusses the importance of Big Data in various applications such as transportation, healthcare, and entertainment, as well as the analytics categories: descriptive, diagnostic, predictive, and prescriptive. Additionally, it explains the MapReduce programming paradigm used in Apache Hadoop for processing large datasets efficiently.

Uploaded by

ramyasrikakani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views12 pages

Big Data Unit1

The document provides an overview of Big Data Analytics, including definitions, types of data (structured, unstructured, semi-structured), and the Five Vs (Volume, Velocity, Variety, Veracity, Value) that characterize Big Data. It discusses the importance of Big Data in various applications such as transportation, healthcare, and entertainment, as well as the analytics categories: descriptive, diagnostic, predictive, and prescriptive. Additionally, it explains the MapReduce programming paradigm used in Apache Hadoop for processing large datasets efficiently.

Uploaded by

ramyasrikakani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

I BSC II SEMESTER BIG DATA ANALYTICS

UNIT-I Type Description Source

Social Data Data collected Facebook,


Introduction to
from various Twitter,
Big Data
social Instagram,
networking LinkedIn
What is Big Data:
sites and
online portals

• Big Data is a field dedicated to the analysis,


Machine Data Data generated RFID chip
processing, and storage of large collections of
from RFID readings,
data sets that frequently originate from different
chips, Sensors GPS results .
sources.
and barcode
• Big Data solutions and practices are typically scanners.
required when traditional data analysis,
processing and storage technologies and Transactional Data collected Amazon,

techniques are insufficient. Data from online Flipkart,


shopping sites, eBay.
• Specifically, Big Data addresses distinct
retailers and
requirements, such as the combining of
B2B
multiple unrelated datasets, processing of
transactions.
large amounts of unstructured data and
harvesting of hidden information in a time- There exist three fundamental data types/formats
sensitive manner. that are processed by Big Data solutions.

• The process of capturing/collecting Big Data Those are:


is known as Datafication. 1. Structured data

• Big Data is a pool of huge amounts of data 2. Unstructured data


of all types, shapes and formats collected
3. Semi-structured data
from various sources.

The following are some common types of data and


their sources:

ANDHRA LOYOLA COLLEGE 1


I BSC II SEMESTER BIG DATA ANALYTICS

Apart from the above three data types, Metadata is • This form of data is either textual or binary.
another important type of data in Big Data A text file may contain the contents of
environment. various tweets or blog postings.

Structured Data: • Binary files are often media files that

Structured data conforms to a data model or contain image, audio or video data.

schema. It is often stored in a tabular form. This Special purpose logic is usually required to process
makes it easier for any program to sort, read and and store unstructured data.
process the data. It is most often stored in a
Unstructured data cannot be directly processed or
relational database.
queried using SQL. If it is required to be stored
Structured data is frequently generated by within a relational database, it is stored in a table as
enterprise applications and information systems a Binary Large Object (BLOB).
like ERP and CRM systems. Examples of this type
Alternatively, a Not-only SQL (NoSQL) database
of data include banking transactions, invoices, and
is a non-relational database that can be used to store
customer records.
unstructured data alongside structured data.
The following symbol can be used to represent
structured data:

Semi-structured Data:

Semi-structured data has a defined level of

Figure: Symbol to represent Structured Data. structure and consistency. A semi-structured data is
hierarchical or graph-based.
Unstructured Data:
This kind of data is commonly stored in files that
Data that does not conform to a data model or data
contain text. For example, XML and JSON files are
schema is known as unstructured data.
common forms of semi-structured data, and is
• Unstructured data has a faster growth rate than shown in the following figure:
structured data. The following figure shows
some common types of unstructured data:

ANDHRA LOYOLA COLLEGE 2


I BSC II SEMESTER BIG DATA ANALYTICS

Volume

Volume refers to the scale (amount) of data


generated each second from social media, smart
phones, cars, credit cards, M2M sensors,
Due to the textual nature of this data and its photographs, video, etc. The amount of data that is
conformance to some level of structure, it is more to be processed by Big Data solutions is large and
easily processed than unstructured data. ever-growing. Such high data volumes require the
use of highly scalable distributed technologies and
frameworks that are capable of analysing large
Elements of Big Data
volumes of data from different sources.
Five Vs in Big Data
Velocity
For a dataset to be considered Big Data, it must
In Big Data environments, velocity refers to the
possess the following Five characteristics:
speed at which vast amounts of data are being
1. Volume generated, collected and analysed.
2. Velocity Every day the number of emails, twitter messages,
3. Variety photos, video clips, etc. increases at lighting speeds
around the world. Every second of every day data
4. Veracity
is increasing. Not only must it be analysed, but the
5. Value
speed of transmission, and access to the data must
These Five characteristics are commonly known as also important for real-time access to website,
Five V’s. credit card verification and instant messaging. Big
data technology allows us now to analyse the data
while it is being generated, without ever putting it
into databases.

Figure: Examples of High Velocity Big data sets.

ANDHRA LOYOLA COLLEGE 3


I BSC II SEMESTER BIG DATA ANALYTICS

Variety Data with a high signal-to-noise ratio has more


veracity than data with a lower ratio. The signal-
Data variety refers to the diversity of the data that
to-noise ratio of data is dependent upon the source
need to be supported. Data variety brings
of the data and its type.
challenges for enterprises in terms of data
integration, transformation, processing, and Value
storage. The following figure provides a visual Value is defined as the usefulness of data for an
representation of data variety: enterprise. The data with higher veracity holds
more value. Value is also dependent on how long
data processing takes.

The longer it takes for data to be turned into


meaningful information, the less value it has for a
business.

The following figure provides two illustrations of


Figure: Examples of high-Variety Big Data how value is impacted by the veracity of data and the
datasets time:
The above figure includes:

• Structured data in the form of financial


transactions,

• Unstructured data in the form of images and

• Semi-structured data in the form of emails.

Veracity

Veracity refers to the quality or correctness of


data. Data that enters Big Data environments will
be processed in order to resolve invalid data and
remove noise.

Data can be part of the signal or noise of a dataset.


Noise is data that cannot be converted into
information and thus has no value, whereas signals
have value and lead to meaningful information.

ANDHRA LOYOLA COLLEGE 4


I BSC II SEMESTER BIG DATA ANALYTICS

Importance of Big Data The following symbol can be used to represent


data analytics:
• Big Data solutions are ideal for analysing
not only raw structured data, but semi
structured and unstructured data from a
wide variety of sources.
• Big Data solutions are ideal when all, or
most, of the data needs to be analysed There are four general categories of analytics:
versus a sample of the data; or a sampling 1. Descriptive analytics
of data isn’t nearly as effective as a larger
2. Diagnostic analytics
set of data from which to derive analysis.
• Big Data solutions are ideal for 3. Predictive analytics

iterative and exploratory analysis when 4. Prescriptive analytics


business measures on data are not
predetermined. Descriptive Analytics
• Big Data is well suited for solving Descriptive analytics are carried out to answer
information challenges that don’t questions about events that have already occurred.
natively fit within a traditional This form of analytics contextualizes data to
relational database approach for generate information.
handling the problem at hand. Sample questions can include:
• What was the sales volume over the past 12
Big Data Analytics months?
Big Data analytics is a discipline that includes the • What is the monthly commission earned
collecting, cleansing, organizing, storing, by each sales agent?
analysing and governing large data sets. Diagnostic Analytics:

Diagnostic analytics aim to determine the cause of


In Big Data environments, data analytics has
a phenomenon that occurred in the past using
developed methods that allow data analysis to
questions that focus on the reason behind the event.
occur through the use of highly scalable distributed
Such questions include:
technologies and frameworks that are capable of
analysing large volumes of data from different • Why were Q2 sales less than Q1 sales?

sources. • Why was there an increase in patient re-


admission rates over the past three months?

ANDHRA LOYOLA COLLEGE 5


I BSC II SEMESTER BIG DATA ANALYTICS

Diagnostic analytics provide more value than Big Data Applications


descriptive analytics but require a more advanced
The following are some of the domains where Big
skillset.
Data Applications has been revolutionized:

Predictive Analytics Transportation: Big Data has greatly improved


Predictive analytics are carried out in an attempt transportation services. The data containing the
to determine the outcome of an event that might traffic information in a city is analysed to identify
occur in the future. traffic jam areas. So that it is possible to take
Questions are usually formulated using a what-if suitable steps on the basis of this analysis.
rationale, such as the following: Education: Opting for big data powered
technology as a learning tool has enhanced the
• What will be the patient survival rate if learning of students as well aided the teacher to
Drug B is administered instead of Drug A? track their performance better.
• If a customer has purchased Products A and
Automobile: Rolls Royce has fitting hundreds of
B, what are the chances that they will also
sensors into its engines and propulsion systems to
purchase Product C?
record every tiny detail about their operation. The
changes in data in real-time are reported to
Prescriptive Analytics engineers who will decide the best course of action
Prescriptive analytics build upon the results of such as scheduling maintenance or dispatching
predictive analytics by prescribing actions that engineering teams, based on the Big Data analytics.
should be taken. This kind of analytics can be used
Entertainment: Netflix and Amazon use Big
to gain an advantage or mitigate a risk.
Data to make shows and movie recommendations
Sample questions may include:
to their users.
• Among three drugs, which one provides the
best results? Insurance: Uses Big data to predict illness,
accidents and price their products accordingly.
• When is the best time to trade a particular
stock? Driver-less Cars: Google’s driver-less cars
collect about one gigabyte of data per second.

These experiments require more and more data for


their successful execution.

ANDHRA LOYOLA COLLEGE 6


I BSC II SEMESTER BIG DATA ANALYTICS

Government: A very interesting use of Big • The Map job takes a set of data and
Data is in the field of politics to analyse patterns converts it into another set of data. Then
and influence election results. Cambridge it broke down the individual elements
Analytica Ltd. is one such organisation which into key/value pairs (tuples).
completely drives on data to change audience • The Reducer phase takes place after
behaviour and plays a major role in the electoral mapper phase has been completed. The
process output of a Mapper or map job (key-

Healthcare: Big data is very useful in the value pairs) is input to the Reducer. The

healthcare industry. Important clinical reducer receives the key-value pair from

knowledge and a deeper understanding of multiple map jobs.

patient disease patterns can be studied from the Then, the reducer aggregates those intermediate
patient’s electronic health records (EHR). It will key-value pair into a smaller set of key-value pairs as
help to improve patient care and improve the final output.
efficiency.
The following figure shows the functioning of a
Telecom: Big data analytics can help the MapReduce:
Communication Service Providers (CSPs)
improve profitability by optimizing network
services/usage, enhancing customer experience,
and improving security.

Map Reduce

MapReduce is the heart of Apache Hadoop. It is a


programming paradigm of Apache Hadoop.
MapReduce can process the data parallelly in
distributed environment. It enables massive
scalability across hundreds or thousands of servers
Let us understand, how a MapReduce works by
in a Hadoop cluster.
a word count program manually. Let us
The term "MapReduce" actually refers to two consider a text file called example.txt having
phases: the following contents:

1. Map Amaravati, Bengaluru, Chennai,


2. Reduce.
Chennai, Amaravati, Amaravati

Bengaluru, Amaravati, Chennai

ANDHRA LOYOLA COLLEGE 7


I BSC II SEMESTER BIG DATA ANALYTICS

Now, we will find the unique words and the So that all the tuples with the same key are sent
number of occurrences of those unique words. to the corresponding reducer.

The MapReduce process has the following Reducing: Count the values which are present
steps: in that list of values. For example, get a list of
Splitting: First, we divide the input in three values which is [1,1,1,1] for the key Amaravati.
splits as shown in the following figure. This will Then, count the number of ones in the very list
distribute the work among all the map nodes. and give the final output as – Amaravati, 2.

Finally, collect all the output key/value pairs and


Amaravati Bengaluru Chennai
write in the output file. The following figure
depicts the steps in MapReduce word count
Chennai Amaravati Amaravati
process:

Bengaluru Amaravati Chennai

Mapping: We tokenize the words in each of the


mapper and give a hardcoded value (1) to each
of the tokens or words.

Create a list of key-value pair where the key is


nothing but the individual words and value is
one.

So, for the first line (Amaravati Bengaluru


Chennai) we have 3 key-value pairs –
Amaravati, 1; Bengaluru, 1; Chennai, 1. The
mapping process remains the same on all the
nodes.

Sorting & Shuffling: Partition, Sort and Shuffle


the tuples. So, after the sorting and shuffling
phase, each reducer will have a unique key and
a list of values corresponding to that very key.
For example, Amaravati, [1,1,1,1]; Bengaluru,
[1,1]; Chennai [ 1,1,1]..,

ANDHRA LOYOLA COLLEGE 8


I BSC II SEMESTER BIG DATA ANALYTICS

academic sessions in July 1954. The college


offers Intermediate (+2), Degree, and
Scaling Word Count Postgraduate courses as well as conducts
program
research programmes in collaboration with
Using MAPREDUCE
several reputable universities.
A word count program using MapReduce
2. Starting Hadoop DFS and YARN
includes the following steps: Starting the Hadoop DFS and YARN has the following
1. Selecting the Source File 3 steps:
2. 1 Switch to the user hadoopusr
2. Starting Hadoop DFS and YARN su hadoopusr
2.2. Start the Hadoop DFS:
3. Moving the Source File into HDFS
start-dfs.sh
4. Executing the Source File using- 2.3. Start the Hadoop YARN:
MAPREDUCE start-YARN.sh

5. Displaying the Results 3. Moving the Source File into HDFS


Note: This example uses Apache Hadoop Now, we need to move the INPUT file into HDFS,
first. Then only it can execute the map-reduce
stable version 2.9.0 wordcount example. Because MAPREDUCE takes
the input from the HDFS and Sends the output to
HDFS
1. Selecting the Source File: 3.1 Give the following command to move the input
file to HDFS:
You can select an existing file or create a new
one. hadoopdfs -put /home/ravi/Desktop/alc /
1.1 Create a Text File, input Some Words and
3.2. Verify whether the file is succefully placed in
Save it. HDFS or not?

hadoop dfs -cat /alc


Example:
gedit alc.txt 4. Executing the Source File using MAPREDUCE
Executing the word count program has the
Type the following code, For example: following steps:

Andhra Loyola College is managed and 4.1. Change the current directory to the
MAPREDUCE folder:
administered by the members of the Society of
Jesus (Jesuits), a Catholic religious order, cd/usr/local/hadoop/share/hadoop/mapreduce
which has rendered signal service in the fields Now, we can execute the wordcount example by
of education and service to humanity for over using
hadoop jar
450 years. The college was founded in
December 1953 at the request of the Catholic mention the name of the jar: jar hadoop-mapreduce-
examples-2.9.0.jar
bishops of Andhra Pradesh and began its

ANDHRA LOYOLA COLLEGE 9


I BSC II SEMESTER BIG DATA ANALYTICS

mention the class: wordcount and give the Input A BRIEF HISTORY OF HADOOP
file which you placed in HDFS
• 2004—Initial versions of what is now
Hadoop Distributed Filesystem and Map-
Reduce implemented by Doug Cutting and
and also specify the name of the Output file in Mike Cafarella.
which you want to display your result. • December 2005—Nutch ported to the new
For example: framework. Hadoop runs reliably on 20
nodes.
4.2 Executing the File • January 2006—Doug Cutting joins Yahoo!.
• February 2006—Apache Hadoop project
hadoop jar hadoop-mapreduce-examples- officially started to support the standalone
2.9.0.jar wordcount /alc /alcresult development of MapReduce and HDFS.
• February 2006—Adoption of Hadoop by
5. Displaying the Results Yahoo! Grid team.
Displaying the results from the output file has • April 2006—Sort benchmark (10 GB/node)
the following steps: run on 188 nodes in 47.9 hours.
• May 2006—Yahoo! set up a Hadoop
5.1We can check the output using the following research cluster—300 nodes.
command:
• May 2006—Sort benchmark run on 500
nodes in 42 hours (better hardware than
hadoop dfs -ls /alcresult
April benchmark).
It will list you two files:
• October 2006—Research cluster reaches
600 nodes.
• A SUCCESS FILE
• December 2006—Sort benchmark run on
• A PART file.
20 nodes in 1.8 hours, 100 nodes in 3.3
hours, 500 nodes in 5.2 hours, 900 nodes in
For example, as shon below:
7.8 hours.
Found 2 items
• January 2007—Research cluster reaches
-rw-r--r--1 hadoopusr supergroup 0 2018-
900 nodes.
12-02 15:21 /alcresult/_SUCCESS
-rw-r--r--1 hadoopusr supergroup 582 2018- • April 2007—Research clusters—2 clusters
12-02 15:21 /alcresult/part-r-00000 of 1000 nodes.
• April 2008—Won the 1 terabyte sort
*The output/result is available with the PART benchmark in 209 seconds on 900 nodes.
file. • October 2008—Loading 10 terabytes of
data per day on to research clusters.
5.2 Read the part file using the cat command • March 2009—17 clusters with a total of
24,000 nodes.
• April 2009—Won the minute sort by
hadoop dfs -cat /alcresult/part-r-00000
sorting 500 GB in 59 seconds (on 1,400
nodes) and the 100-terabyte sort in 173
minutes (on 3,400 nodes).
* We can also check the result in GUI using a
browser window by connecting to the localhost
with a valid port number 50070.

ANDHRA LOYOLA COLLEGE 10


I BSC II SEMESTER BIG DATA ANALYTICS

Using Flume, we can get the data from


Hadoop Eco System multiple servers immediately into Hadoop.
Sqoop
• Sqoop imports data from external
sources into related Hadoop ecosystem
components like HDFS, HBase or Hive.
• It also exports data from Hadoop to other
external sources.
• Sqoop works with relational databases
such as Teradata, Netezza, oracle,
MySQL.

YARN
• Hadoop
YARN (Yet Another Resource Negotiator)
is a Hadoop ecosystem component that
provides the resource management.
• Yarn is also one the most important
component of Hadoop Ecosystem.
• YARN is called as the operating system of
Hadoop as it is responsible for managing and
Hadoop Distributed File System (HDFS) monitoring workloads.
• It allows multiple data processing engines
• It is the most important component of Hadoop such as real-time streaming and batch
Ecosystem. processing to handle data stored on a single
• HDFS is the primary storage system of platform.
Hadoop. Hadoop distributed file system
(HDFS) is a java based file system that Hive
provides scalable, fault tolerance, reliable and
cost efficient data storage for Big data. • The Hadoop ecosystem
• HDFS is a distributed filesystem that runs on component, Apache Hive, is an open
commodity hardware. source data warehouse system for querying
• HDFS is already configured with default and analyzing large datasets stored in
configuration for many installations. Hadoop files.
• Hadoop interact directly with HDFS by • Hive do three main functions: data
commands. summarization, query, and analysis.
• Hive use language called HiveQL (HQL),
which is similar to SQL.
Flume • HiveQL automatically translates SQL-like
• Flume efficiently collects, aggregate and queries into MapReduce jobs which will
moves a large amount of data from its origin execute on Hadoop.
and sending it back to HDFS.
• It is fault tolerant and reliable mechanism.
This Hadoop Ecosystem component allows
the data flow from the source into Hadoop
environment.
• It uses a simple extensible data model that
allows for the online analytic application.

ANDHRA LOYOLA COLLEGE 11


I BSC II SEMESTER BIG DATA ANALYTICS

Mahout Zookeeper
• Mahout is open source framework for
creating scalable machine • Apache Zookeeper is a centralized service and
learning algorithm and data mining library. a Hadoop Ecosystem component for
maintaining configuration information,
• Once data is stored in Hadoop HDFS,
naming, providing distributed synchronization,
mahout provides the data science tools to and providing group services.
automatically find meaningful patterns in • Zookeeper manages and coordinates a large
those big data sets. cluster of machines.

Pig Oozie

• Apache Pig is a high-level language • It is a workflow scheduler system for managing


platform for analysing and querying huge apache Hadoop jobs.
dataset that are stored in HDFS.
• Oozie combines multiple jobs sequentially into
• Pig as a component of Hadoop Ecosystem
one logical unit of work.
uses PigLatin language.
• It is very similar to SQL.
• It loads the data, applies the required filters • Oozie framework is fully integrated with
and dumps the data in the required format. apache Hadoop stack, YARN as an architecture
centre and supports Hadoop jobs for apache
• For Programs execution, pig requires Java
MapReduce, Pig, Hive, and Sqoop.
runtime environment.

HBase
R Connectors
• Apache HBase is a Hadoop ecosystem
component which is a distributed
• Interfaces to work with Hive tables, the
database that was designed to store
Apache Hadoop compute infrastructure, the
structured data in tables that could have
local R environment, and Oracle database
billions of row and millions of columns.
tables
• HBase is scalable, distributed, and • Predictive analytic techniques, written in R
NoSQL database that is built on top of or Java as Hadoop MapReduce jobs, that can
HDFS. be applied to data in HDFS files
• HBase, provide real-time access to read
or write data in HDFS.

Ambari
• Ambari, another Hadoop ecosystem
component, is a management platform for
provisioning, managing, monitoring and
securing apache Hadoop cluster.

• Hadoop management gets simpler as


Ambari provide consistent, secure platform
for operational control.

ANDHRA LOYOLA COLLEGE 12

You might also like