0% found this document useful (0 votes)

18 views12 pages

Big Data Unit1

The document provides an overview of Big Data Analytics, including definitions, types of data (structured, unstructured, semi-structured), and the Five Vs (Volume, Velocity, Variety, Veracity, Value) that characterize Big Data. It discusses the importance of Big Data in various applications such as transportation, healthcare, and entertainment, as well as the analytics categories: descriptive, diagnostic, predictive, and prescriptive. Additionally, it explains the MapReduce programming paradigm used in Apache Hadoop for processing large datasets efficiently.

Uploaded by

ramyasrikakani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views12 pages

Big Data Unit1

Uploaded by

ramyasrikakani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

I BSC II SEMESTER BIG DATA ANALYTICS

UNIT-I Type Description Source

Social Data Data collected Facebook,

Introduction to
from various Twitter,
Big Data
social Instagram,
networking LinkedIn
What is Big Data:
sites and
online portals

• Big Data is a field dedicated to the analysis,

Machine Data Data generated RFID chip
processing, and storage of large collections of
from RFID readings,
data sets that frequently originate from different
chips, Sensors GPS results .
sources.
and barcode
• Big Data solutions and practices are typically scanners.
required when traditional data analysis,
processing and storage technologies and Transactional Data collected Amazon,

techniques are insufficient. Data from online Flipkart,

shopping sites, eBay.
• Specifically, Big Data addresses distinct
retailers and
requirements, such as the combining of
B2B
multiple unrelated datasets, processing of
transactions.
large amounts of unstructured data and
harvesting of hidden information in a time- There exist three fundamental data types/formats
sensitive manner. that are processed by Big Data solutions.

• The process of capturing/collecting Big Data Those are:

is known as Datafication. 1. Structured data

• Big Data is a pool of huge amounts of data 2. Unstructured data

of all types, shapes and formats collected
3. Semi-structured data
from various sources.

The following are some common types of data and

their sources:

ANDHRA LOYOLA COLLEGE 1

I BSC II SEMESTER BIG DATA ANALYTICS

Apart from the above three data types, Metadata is • This form of data is either textual or binary.
another important type of data in Big Data A text file may contain the contents of
environment. various tweets or blog postings.

Structured Data: • Binary files are often media files that

Structured data conforms to a data model or contain image, audio or video data.

schema. It is often stored in a tabular form. This Special purpose logic is usually required to process
makes it easier for any program to sort, read and and store unstructured data.
process the data. It is most often stored in a
Unstructured data cannot be directly processed or
relational database.
queried using SQL. If it is required to be stored
Structured data is frequently generated by within a relational database, it is stored in a table as
enterprise applications and information systems a Binary Large Object (BLOB).
like ERP and CRM systems. Examples of this type
Alternatively, a Not-only SQL (NoSQL) database
of data include banking transactions, invoices, and
is a non-relational database that can be used to store
customer records.
unstructured data alongside structured data.
The following symbol can be used to represent
structured data:

Semi-structured Data:

Semi-structured data has a defined level of

Figure: Symbol to represent Structured Data. structure and consistency. A semi-structured data is
hierarchical or graph-based.
Unstructured Data:
This kind of data is commonly stored in files that
Data that does not conform to a data model or data
contain text. For example, XML and JSON files are
schema is known as unstructured data.
common forms of semi-structured data, and is
• Unstructured data has a faster growth rate than shown in the following figure:
structured data. The following figure shows
some common types of unstructured data:

ANDHRA LOYOLA COLLEGE 2

I BSC II SEMESTER BIG DATA ANALYTICS

Volume

Volume refers to the scale (amount) of data

generated each second from social media, smart
phones, cars, credit cards, M2M sensors,
Due to the textual nature of this data and its photographs, video, etc. The amount of data that is
conformance to some level of structure, it is more to be processed by Big Data solutions is large and
easily processed than unstructured data. ever-growing. Such high data volumes require the
use of highly scalable distributed technologies and
frameworks that are capable of analysing large
Elements of Big Data
volumes of data from different sources.
Five Vs in Big Data
Velocity
For a dataset to be considered Big Data, it must
In Big Data environments, velocity refers to the
possess the following Five characteristics:
speed at which vast amounts of data are being
1. Volume generated, collected and analysed.
2. Velocity Every day the number of emails, twitter messages,
3. Variety photos, video clips, etc. increases at lighting speeds
around the world. Every second of every day data
4. Veracity
is increasing. Not only must it be analysed, but the
5. Value
speed of transmission, and access to the data must
These Five characteristics are commonly known as also important for real-time access to website,
Five V’s. credit card verification and instant messaging. Big
data technology allows us now to analyse the data
while it is being generated, without ever putting it
into databases.

Figure: Examples of High Velocity Big data sets.

ANDHRA LOYOLA COLLEGE 3

I BSC II SEMESTER BIG DATA ANALYTICS

Variety Data with a high signal-to-noise ratio has more

veracity than data with a lower ratio. The signal-
Data variety refers to the diversity of the data that
to-noise ratio of data is dependent upon the source
need to be supported. Data variety brings
of the data and its type.
challenges for enterprises in terms of data
integration, transformation, processing, and Value
storage. The following figure provides a visual Value is defined as the usefulness of data for an
representation of data variety: enterprise. The data with higher veracity holds
more value. Value is also dependent on how long
data processing takes.

The longer it takes for data to be turned into

meaningful information, the less value it has for a
business.

The following figure provides two illustrations of

Figure: Examples of high-Variety Big Data how value is impacted by the veracity of data and the
datasets time:
The above figure includes:

• Structured data in the form of financial

transactions,

• Unstructured data in the form of images and

• Semi-structured data in the form of emails.

Veracity

Veracity refers to the quality or correctness of

data. Data that enters Big Data environments will
be processed in order to resolve invalid data and
remove noise.

Data can be part of the signal or noise of a dataset.

Noise is data that cannot be converted into
information and thus has no value, whereas signals
have value and lead to meaningful information.

ANDHRA LOYOLA COLLEGE 4

I BSC II SEMESTER BIG DATA ANALYTICS

Importance of Big Data The following symbol can be used to represent

data analytics:
• Big Data solutions are ideal for analysing
not only raw structured data, but semi
structured and unstructured data from a
wide variety of sources.
• Big Data solutions are ideal when all, or
most, of the data needs to be analysed There are four general categories of analytics:
versus a sample of the data; or a sampling 1. Descriptive analytics
of data isn’t nearly as effective as a larger
2. Diagnostic analytics
set of data from which to derive analysis.
• Big Data solutions are ideal for 3. Predictive analytics

iterative and exploratory analysis when 4. Prescriptive analytics

business measures on data are not
predetermined. Descriptive Analytics
• Big Data is well suited for solving Descriptive analytics are carried out to answer
information challenges that don’t questions about events that have already occurred.
natively fit within a traditional This form of analytics contextualizes data to
relational database approach for generate information.
handling the problem at hand. Sample questions can include:
• What was the sales volume over the past 12
Big Data Analytics months?
Big Data analytics is a discipline that includes the • What is the monthly commission earned
collecting, cleansing, organizing, storing, by each sales agent?
analysing and governing large data sets. Diagnostic Analytics:

Diagnostic analytics aim to determine the cause of

In Big Data environments, data analytics has
a phenomenon that occurred in the past using
developed methods that allow data analysis to
questions that focus on the reason behind the event.
occur through the use of highly scalable distributed
Such questions include:
technologies and frameworks that are capable of
analysing large volumes of data from different • Why were Q2 sales less than Q1 sales?

sources. • Why was there an increase in patient re-

admission rates over the past three months?

ANDHRA LOYOLA COLLEGE 5

I BSC II SEMESTER BIG DATA ANALYTICS

Diagnostic analytics provide more value than Big Data Applications

descriptive analytics but require a more advanced
The following are some of the domains where Big
skillset.
Data Applications has been revolutionized:

Predictive Analytics Transportation: Big Data has greatly improved

Predictive analytics are carried out in an attempt transportation services. The data containing the
to determine the outcome of an event that might traffic information in a city is analysed to identify
occur in the future. traffic jam areas. So that it is possible to take
Questions are usually formulated using a what-if suitable steps on the basis of this analysis.
rationale, such as the following: Education: Opting for big data powered
technology as a learning tool has enhanced the
• What will be the patient survival rate if learning of students as well aided the teacher to
Drug B is administered instead of Drug A? track their performance better.
• If a customer has purchased Products A and
Automobile: Rolls Royce has fitting hundreds of
B, what are the chances that they will also
sensors into its engines and propulsion systems to
purchase Product C?
record every tiny detail about their operation. The
changes in data in real-time are reported to
Prescriptive Analytics engineers who will decide the best course of action
Prescriptive analytics build upon the results of such as scheduling maintenance or dispatching
predictive analytics by prescribing actions that engineering teams, based on the Big Data analytics.
should be taken. This kind of analytics can be used
Entertainment: Netflix and Amazon use Big
to gain an advantage or mitigate a risk.
Data to make shows and movie recommendations
Sample questions may include:
to their users.
• Among three drugs, which one provides the
best results? Insurance: Uses Big data to predict illness,
accidents and price their products accordingly.
• When is the best time to trade a particular
stock? Driver-less Cars: Google’s driver-less cars
collect about one gigabyte of data per second.

These experiments require more and more data for

their successful execution.

ANDHRA LOYOLA COLLEGE 6

I BSC II SEMESTER BIG DATA ANALYTICS

Government: A very interesting use of Big • The Map job takes a set of data and
Data is in the field of politics to analyse patterns converts it into another set of data. Then
and influence election results. Cambridge it broke down the individual elements
Analytica Ltd. is one such organisation which into key/value pairs (tuples).
completely drives on data to change audience • The Reducer phase takes place after
behaviour and plays a major role in the electoral mapper phase has been completed. The
process output of a Mapper or map job (key-

Healthcare: Big data is very useful in the value pairs) is input to the Reducer. The

healthcare industry. Important clinical reducer receives the key-value pair from

knowledge and a deeper understanding of multiple map jobs.

patient disease patterns can be studied from the Then, the reducer aggregates those intermediate
patient’s electronic health records (EHR). It will key-value pair into a smaller set of key-value pairs as
help to improve patient care and improve the final output.
efficiency.
The following figure shows the functioning of a
Telecom: Big data analytics can help the MapReduce:
Communication Service Providers (CSPs)
improve profitability by optimizing network
services/usage, enhancing customer experience,
and improving security.

Map Reduce

MapReduce is the heart of Apache Hadoop. It is a

programming paradigm of Apache Hadoop.
MapReduce can process the data parallelly in
distributed environment. It enables massive
scalability across hundreds or thousands of servers
Let us understand, how a MapReduce works by
in a Hadoop cluster.
a word count program manually. Let us
The term "MapReduce" actually refers to two consider a text file called example.txt having
phases: the following contents:

1. Map Amaravati, Bengaluru, Chennai,

2. Reduce.
Chennai, Amaravati, Amaravati

Bengaluru, Amaravati, Chennai

ANDHRA LOYOLA COLLEGE 7

I BSC II SEMESTER BIG DATA ANALYTICS

Now, we will find the unique words and the So that all the tuples with the same key are sent
number of occurrences of those unique words. to the corresponding reducer.

The MapReduce process has the following Reducing: Count the values which are present
steps: in that list of values. For example, get a list of
Splitting: First, we divide the input in three values which is [1,1,1,1] for the key Amaravati.
splits as shown in the following figure. This will Then, count the number of ones in the very list
distribute the work among all the map nodes. and give the final output as – Amaravati, 2.

Finally, collect all the output key/value pairs and

Amaravati Bengaluru Chennai
write in the output file. The following figure
depicts the steps in MapReduce word count
Chennai Amaravati Amaravati
process:

Bengaluru Amaravati Chennai

Mapping: We tokenize the words in each of the

mapper and give a hardcoded value (1) to each
of the tokens or words.

Create a list of key-value pair where the key is

nothing but the individual words and value is
one.

So, for the first line (Amaravati Bengaluru

Chennai) we have 3 key-value pairs –
Amaravati, 1; Bengaluru, 1; Chennai, 1. The
mapping process remains the same on all the
nodes.

Sorting & Shuffling: Partition, Sort and Shuffle

the tuples. So, after the sorting and shuffling
phase, each reducer will have a unique key and
a list of values corresponding to that very key.
For example, Amaravati, [1,1,1,1]; Bengaluru,
[1,1]; Chennai [ 1,1,1]..,

ANDHRA LOYOLA COLLEGE 8

I BSC II SEMESTER BIG DATA ANALYTICS

academic sessions in July 1954. The college

offers Intermediate (+2), Degree, and
Scaling Word Count Postgraduate courses as well as conducts
program
research programmes in collaboration with
Using MAPREDUCE
several reputable universities.
A word count program using MapReduce
2. Starting Hadoop DFS and YARN
includes the following steps: Starting the Hadoop DFS and YARN has the following
1. Selecting the Source File 3 steps:
2. 1 Switch to the user hadoopusr
2. Starting Hadoop DFS and YARN su hadoopusr
2.2. Start the Hadoop DFS:
3. Moving the Source File into HDFS
start-dfs.sh
4. Executing the Source File using- 2.3. Start the Hadoop YARN:
MAPREDUCE start-YARN.sh

5. Displaying the Results 3. Moving the Source File into HDFS

Note: This example uses Apache Hadoop Now, we need to move the INPUT file into HDFS,
first. Then only it can execute the map-reduce
stable version 2.9.0 wordcount example. Because MAPREDUCE takes
the input from the HDFS and Sends the output to
HDFS
1. Selecting the Source File: 3.1 Give the following command to move the input
file to HDFS:
You can select an existing file or create a new
one. hadoopdfs -put /home/ravi/Desktop/alc /
1.1 Create a Text File, input Some Words and
3.2. Verify whether the file is succefully placed in
Save it. HDFS or not?

hadoop dfs -cat /alc

Example:
gedit alc.txt 4. Executing the Source File using MAPREDUCE
Executing the word count program has the
Type the following code, For example: following steps:

Andhra Loyola College is managed and 4.1. Change the current directory to the
MAPREDUCE folder:
administered by the members of the Society of
Jesus (Jesuits), a Catholic religious order, cd/usr/local/hadoop/share/hadoop/mapreduce
which has rendered signal service in the fields Now, we can execute the wordcount example by
of education and service to humanity for over using
hadoop jar
450 years. The college was founded in
December 1953 at the request of the Catholic mention the name of the jar: jar hadoop-mapreduce-
examples-2.9.0.jar
bishops of Andhra Pradesh and began its

ANDHRA LOYOLA COLLEGE 9

I BSC II SEMESTER BIG DATA ANALYTICS

mention the class: wordcount and give the Input A BRIEF HISTORY OF HADOOP
file which you placed in HDFS
• 2004—Initial versions of what is now
Hadoop Distributed Filesystem and Map-
Reduce implemented by Doug Cutting and
and also specify the name of the Output file in Mike Cafarella.
which you want to display your result. • December 2005—Nutch ported to the new
For example: framework. Hadoop runs reliably on 20
nodes.
4.2 Executing the File • January 2006—Doug Cutting joins Yahoo!.
• February 2006—Apache Hadoop project
hadoop jar hadoop-mapreduce-examples- officially started to support the standalone
2.9.0.jar wordcount /alc /alcresult development of MapReduce and HDFS.
• February 2006—Adoption of Hadoop by
5. Displaying the Results Yahoo! Grid team.
Displaying the results from the output file has • April 2006—Sort benchmark (10 GB/node)
the following steps: run on 188 nodes in 47.9 hours.
• May 2006—Yahoo! set up a Hadoop
5.1We can check the output using the following research cluster—300 nodes.
command:
• May 2006—Sort benchmark run on 500
nodes in 42 hours (better hardware than
hadoop dfs -ls /alcresult
April benchmark).
It will list you two files:
• October 2006—Research cluster reaches
600 nodes.
• A SUCCESS FILE
• December 2006—Sort benchmark run on
• A PART file.
20 nodes in 1.8 hours, 100 nodes in 3.3
hours, 500 nodes in 5.2 hours, 900 nodes in
For example, as shon below:
7.8 hours.
Found 2 items
• January 2007—Research cluster reaches
-rw-r--r--1 hadoopusr supergroup 0 2018-
900 nodes.
12-02 15:21 /alcresult/_SUCCESS
-rw-r--r--1 hadoopusr supergroup 582 2018- • April 2007—Research clusters—2 clusters
12-02 15:21 /alcresult/part-r-00000 of 1000 nodes.
• April 2008—Won the 1 terabyte sort
*The output/result is available with the PART benchmark in 209 seconds on 900 nodes.
file. • October 2008—Loading 10 terabytes of
data per day on to research clusters.
5.2 Read the part file using the cat command • March 2009—17 clusters with a total of
24,000 nodes.
• April 2009—Won the minute sort by
hadoop dfs -cat /alcresult/part-r-00000
sorting 500 GB in 59 seconds (on 1,400
nodes) and the 100-terabyte sort in 173
minutes (on 3,400 nodes).
* We can also check the result in GUI using a
browser window by connecting to the localhost
with a valid port number 50070.

ANDHRA LOYOLA COLLEGE 10

I BSC II SEMESTER BIG DATA ANALYTICS

Using Flume, we can get the data from

Hadoop Eco System multiple servers immediately into Hadoop.
Sqoop
• Sqoop imports data from external
sources into related Hadoop ecosystem
components like HDFS, HBase or Hive.
• It also exports data from Hadoop to other
external sources.
• Sqoop works with relational databases
such as Teradata, Netezza, oracle,
MySQL.

YARN
• Hadoop
YARN (Yet Another Resource Negotiator)
is a Hadoop ecosystem component that
provides the resource management.
• Yarn is also one the most important
component of Hadoop Ecosystem.
• YARN is called as the operating system of
Hadoop as it is responsible for managing and
Hadoop Distributed File System (HDFS) monitoring workloads.
• It allows multiple data processing engines
• It is the most important component of Hadoop such as real-time streaming and batch
Ecosystem. processing to handle data stored on a single
• HDFS is the primary storage system of platform.
Hadoop. Hadoop distributed file system
(HDFS) is a java based file system that Hive
provides scalable, fault tolerance, reliable and
cost efficient data storage for Big data. • The Hadoop ecosystem
• HDFS is a distributed filesystem that runs on component, Apache Hive, is an open
commodity hardware. source data warehouse system for querying
• HDFS is already configured with default and analyzing large datasets stored in
configuration for many installations. Hadoop files.
• Hadoop interact directly with HDFS by • Hive do three main functions: data
commands. summarization, query, and analysis.
• Hive use language called HiveQL (HQL),
which is similar to SQL.
Flume • HiveQL automatically translates SQL-like
• Flume efficiently collects, aggregate and queries into MapReduce jobs which will
moves a large amount of data from its origin execute on Hadoop.
and sending it back to HDFS.
• It is fault tolerant and reliable mechanism.
This Hadoop Ecosystem component allows
the data flow from the source into Hadoop
environment.
• It uses a simple extensible data model that
allows for the online analytic application.

ANDHRA LOYOLA COLLEGE 11

I BSC II SEMESTER BIG DATA ANALYTICS

Mahout Zookeeper
• Mahout is open source framework for
creating scalable machine • Apache Zookeeper is a centralized service and
learning algorithm and data mining library. a Hadoop Ecosystem component for
maintaining configuration information,
• Once data is stored in Hadoop HDFS,
naming, providing distributed synchronization,
mahout provides the data science tools to and providing group services.
automatically find meaningful patterns in • Zookeeper manages and coordinates a large
those big data sets. cluster of machines.

Pig Oozie

• Apache Pig is a high-level language • It is a workflow scheduler system for managing

platform for analysing and querying huge apache Hadoop jobs.
dataset that are stored in HDFS.
• Oozie combines multiple jobs sequentially into
• Pig as a component of Hadoop Ecosystem
one logical unit of work.
uses PigLatin language.
• It is very similar to SQL.
• It loads the data, applies the required filters • Oozie framework is fully integrated with
and dumps the data in the required format. apache Hadoop stack, YARN as an architecture
centre and supports Hadoop jobs for apache
• For Programs execution, pig requires Java
MapReduce, Pig, Hive, and Sqoop.
runtime environment.

HBase
R Connectors
• Apache HBase is a Hadoop ecosystem
component which is a distributed
• Interfaces to work with Hive tables, the
database that was designed to store
Apache Hadoop compute infrastructure, the
structured data in tables that could have
local R environment, and Oracle database
billions of row and millions of columns.
tables
• HBase is scalable, distributed, and • Predictive analytic techniques, written in R
NoSQL database that is built on top of or Java as Hadoop MapReduce jobs, that can
HDFS. be applied to data in HDFS files
• HBase, provide real-time access to read
or write data in HDFS.

Ambari
• Ambari, another Hadoop ecosystem
component, is a management platform for
provisioning, managing, monitoring and
securing apache Hadoop cluster.

• Hadoop management gets simpler as

Ambari provide consistent, secure platform
for operational control.

ANDHRA LOYOLA COLLEGE 12

Unit I - BDA
No ratings yet
Unit I - BDA
12 pages
Big Data Basics for Students
No ratings yet
Big Data Basics for Students
4 pages
Course Material
100% (1)
Course Material
57 pages
Big-Data-Unit 1
No ratings yet
Big-Data-Unit 1
23 pages
Big Data: Insights and Applications
No ratings yet
Big Data: Insights and Applications
7 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
5 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
22 pages
Bda Module 1 Notes
No ratings yet
Bda Module 1 Notes
10 pages
Big Data and Hadoop Overview
No ratings yet
Big Data and Hadoop Overview
19 pages
Unit 4
No ratings yet
Unit 4
29 pages
Big Data
No ratings yet
Big Data
3 pages
Da Unit-1
No ratings yet
Da Unit-1
24 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
R19 Bda Unit-1
No ratings yet
R19 Bda Unit-1
22 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
AMR Assignment
No ratings yet
AMR Assignment
11 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
20 pages
Undestanding Data Module-3
No ratings yet
Undestanding Data Module-3
8 pages
Unit 1
No ratings yet
Unit 1
20 pages
Understanding Big Data: Definition & Types
No ratings yet
Understanding Big Data: Definition & Types
15 pages
Introduction to Big Data Concepts
100% (2)
Introduction to Big Data Concepts
33 pages
Big Data: Meaning, Types, and 5 Vs
No ratings yet
Big Data: Meaning, Types, and 5 Vs
4 pages
Informatics Engineering, An International Journal (IEIJ)
No ratings yet
Informatics Engineering, An International Journal (IEIJ)
20 pages
Big Data Research Overview
No ratings yet
Big Data Research Overview
20 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
23 pages
Big Data Answers
No ratings yet
Big Data Answers
14 pages
Big Data Lecture # 1
No ratings yet
Big Data Lecture # 1
15 pages
Ds Unit-1
No ratings yet
Ds Unit-1
19 pages
12 AI Unit 5 Introduction To Big Data and Data Analytics
No ratings yet
12 AI Unit 5 Introduction To Big Data and Data Analytics
18 pages
Big Data: Concepts and Applications
No ratings yet
Big Data: Concepts and Applications
5 pages
Bda Mod1,2,3
No ratings yet
Bda Mod1,2,3
35 pages
Rethinkdb
No ratings yet
Rethinkdb
48 pages
Big Data Module 1
No ratings yet
Big Data Module 1
14 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
26 pages
Unit Iii Big Data Analytics What Is Data?
No ratings yet
Unit Iii Big Data Analytics What Is Data?
36 pages
R II Bca IV Sem Unit 3 Balu Sir
No ratings yet
R II Bca IV Sem Unit 3 Balu Sir
14 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Big Data
No ratings yet
Big Data
7 pages
BDA Unit 1
No ratings yet
BDA Unit 1
22 pages
Big Data For Education in Student S' Perspective: G. Vaitheeswaran L. Arockiam
No ratings yet
Big Data For Education in Student S' Perspective: G. Vaitheeswaran L. Arockiam
7 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
(IJCST-V9I6P1) :yew Kee Wong
No ratings yet
(IJCST-V9I6P1) :yew Kee Wong
7 pages
Unit-01 Bda
No ratings yet
Unit-01 Bda
25 pages
Big Data Analytics Final
No ratings yet
Big Data Analytics Final
16 pages
Imp Answers
No ratings yet
Imp Answers
29 pages
Big Data: Definition, Types, and Applications
No ratings yet
Big Data: Definition, Types, and Applications
4 pages
What Is Data
No ratings yet
What Is Data
24 pages
What Is Data
No ratings yet
What Is Data
8 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Big Data Analytics and Case Studies
No ratings yet
Big Data Analytics and Case Studies
5 pages
Data Science & Big Data Essentials
No ratings yet
Data Science & Big Data Essentials
31 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
Understanding Big Data Characteristics
No ratings yet
Understanding Big Data Characteristics
20 pages
BigData UNIT-1
No ratings yet
BigData UNIT-1
19 pages
An Approach To Analysis and Classification of Data From Big Data by Using Apriori Algorithm
No ratings yet
An Approach To Analysis and Classification of Data From Big Data by Using Apriori Algorithm
4 pages
Jsaer2016 03 02 106 108
No ratings yet
Jsaer2016 03 02 106 108
3 pages
Ict Unit
No ratings yet
Ict Unit
11 pages
Dlunit1 240424042628 Ef28635c
No ratings yet
Dlunit1 240424042628 Ef28635c
17 pages
? Estimation Techniques in Software Engineering
No ratings yet
? Estimation Techniques in Software Engineering
7 pages
Unit 3 (DL)
No ratings yet
Unit 3 (DL)
28 pages
BigData Unit 4
No ratings yet
BigData Unit 4
12 pages
Data Types
No ratings yet
Data Types
1 page
Turingwave Introduction 0902
No ratings yet
Turingwave Introduction 0902
11 pages
Potensi Sapi Potong Situbondo
No ratings yet
Potensi Sapi Potong Situbondo
11 pages
Day 1 Assignment
No ratings yet
Day 1 Assignment
2 pages
MDG Config Example
No ratings yet
MDG Config Example
9 pages
CSS Basics for Web Developers
No ratings yet
CSS Basics for Web Developers
35 pages
Document 1909586.1
No ratings yet
Document 1909586.1
2 pages
Editing: A Problem Statement
No ratings yet
Editing: A Problem Statement
6 pages
Chapter Five: E-Commerce Payment Systems
No ratings yet
Chapter Five: E-Commerce Payment Systems
17 pages
AZ 500 Demo
No ratings yet
AZ 500 Demo
42 pages
Costs & Scaling Up Software
No ratings yet
Costs & Scaling Up Software
24 pages
DSA Tools For Web Developers
No ratings yet
DSA Tools For Web Developers
15 pages
MIS Practical File
No ratings yet
MIS Practical File
27 pages
SEO Proposal and Agreement
No ratings yet
SEO Proposal and Agreement
9 pages
IN 1052 UpgradingFromVersions1040And1041 en
No ratings yet
IN 1052 UpgradingFromVersions1040And1041 en
108 pages
Gantt Chart
No ratings yet
Gantt Chart
1 page
Document2523220 1
No ratings yet
Document2523220 1
246 pages
Introduction to CSC Programming Concepts
No ratings yet
Introduction to CSC Programming Concepts
19 pages
9 - HONORS - HTCS501 Notes - What Is Plaintext
No ratings yet
9 - HONORS - HTCS501 Notes - What Is Plaintext
2 pages
Data Science: Career Guide
No ratings yet
Data Science: Career Guide
7 pages
Acquisition, Development and Implementation of Information Systems
No ratings yet
Acquisition, Development and Implementation of Information Systems
32 pages
Waterfall Model Prototyping Model
No ratings yet
Waterfall Model Prototyping Model
26 pages
Ipv6/Ipv4 Xlate Trial Service For Sharing Ipv4 Address
No ratings yet
Ipv6/Ipv4 Xlate Trial Service For Sharing Ipv4 Address
21 pages
Man en Netio-230a 2.20
No ratings yet
Man en Netio-230a 2.20
24 pages
Software Testing and Architecture Insights
No ratings yet
Software Testing and Architecture Insights
26 pages
Tmw21544 Big Data Analytics Ref
No ratings yet
Tmw21544 Big Data Analytics Ref
35 pages
Sas Clinical Data Integration Fact Sheet
No ratings yet
Sas Clinical Data Integration Fact Sheet
4 pages
Lo2.1 CSS 12
No ratings yet
Lo2.1 CSS 12
6 pages
Bab 4 Programs and Apps Productivity, Graphics, Security and Other Tools
No ratings yet
Bab 4 Programs and Apps Productivity, Graphics, Security and Other Tools
42 pages
EKS and Fargate Study Guide
No ratings yet
EKS and Fargate Study Guide
4 pages
Malware Analyst
No ratings yet
Malware Analyst
5 pages

Big Data Unit1

Uploaded by

Big Data Unit1

Uploaded by

I BSC II SEMESTER BIG DATA ANALYTICS

UNIT-I Type Description Source

Social Data Data collected Facebook,

• Big Data is a field dedicated to the analysis,

techniques are insufficient. Data from online Flipkart,

• The process of capturing/collecting Big Data Those are:

• Big Data is a pool of huge amounts of data 2. Unstructured data

The following are some common types of data and

ANDHRA LOYOLA COLLEGE 1

Structured Data: • Binary files are often media files that

Semi-structured data has a defined level of

ANDHRA LOYOLA COLLEGE 2

Volume refers to the scale (amount) of data

Figure: Examples of High Velocity Big data sets.

ANDHRA LOYOLA COLLEGE 3

Variety Data with a high signal-to-noise ratio has more

The longer it takes for data to be turned into

The following figure provides two illustrations of

• Structured data in the form of financial

• Unstructured data in the form of images and

• Semi-structured data in the form of emails.

Veracity refers to the quality or correctness of

Data can be part of the signal or noise of a dataset.

ANDHRA LOYOLA COLLEGE 4

Importance of Big Data The following symbol can be used to represent

iterative and exploratory analysis when 4. Prescriptive analytics

Diagnostic analytics aim to determine the cause of

sources. • Why was there an increase in patient re-

ANDHRA LOYOLA COLLEGE 5

Diagnostic analytics provide more value than Big Data Applications

Predictive Analytics Transportation: Big Data has greatly improved

These experiments require more and more data for

ANDHRA LOYOLA COLLEGE 6

knowledge and a deeper understanding of multiple map jobs.

MapReduce is the heart of Apache Hadoop. It is a

1. Map Amaravati, Bengaluru, Chennai,

Bengaluru, Amaravati, Chennai

ANDHRA LOYOLA COLLEGE 7

Finally, collect all the output key/value pairs and

Bengaluru Amaravati Chennai

Mapping: We tokenize the words in each of the

Create a list of key-value pair where the key is

So, for the first line (Amaravati Bengaluru

Sorting & Shuffling: Partition, Sort and Shuffle

ANDHRA LOYOLA COLLEGE 8

academic sessions in July 1954. The college

5. Displaying the Results 3. Moving the Source File into HDFS

hadoop dfs -cat /alc

ANDHRA LOYOLA COLLEGE 9

ANDHRA LOYOLA COLLEGE 10

Using Flume, we can get the data from

ANDHRA LOYOLA COLLEGE 11

• Apache Pig is a high-level language • It is a workflow scheduler system for managing

• Hadoop management gets simpler as

ANDHRA LOYOLA COLLEGE 12

You might also like