Shivani® Complete Book®
Sor
Educating People Engineering Students
As Per New
Scheme & Syllabus@y
(AICTE FiexitSe Late
New se Th eainaeter }
nation Papers ComyUNIT-1— Introduction to big data, Big data characteristics, Types of big data,
Traditional versus big data, Evolution of big data, challenges with Big Data,
Technologies available for Big Data, Infrastructure for Big data, Use of Data
Analytics, Desired properties of Big Data system.
UNIT-II — Introduction to Hadoop, Core Hadoop components, Hadoop
Ecosystem, Hive Physical Architecture, Hadoop limitations, RDBMS Versus
Hadoop, Hadoop Distributed File sytem, Pocessing Data with Hadoop,
Managing Resources and Application with Hadoop YARN, MapReduce
programming.
ions.
UNIT-III- Introduction to Hive, Hive Architecture, Hive Data types, Hive Query
Language, Introduction to Pig, Anatomy of Pig, Pig on Hadoop, Use Case for
uagewith Pig, ETL Processing, Data types in Pig, running Pig, Execution model of
Pig, Operators, functions, Data types of Pig.
nt friendly.
i ut
—_ UNIT-IV — Introduction to: NoSQL, NoSQL Business Drivers, NoSQL Data
architectural patterns, Variations of NoSQL architectural patterns, using
‘cach
a NoSQL to Manage Big Data. Introduction to MongoDB.
degree of UNIT-V - Mining social Network Graphs — Introduction, Applications of social
Network mining, Social Networks as a Graph, types of social Networks,
included Clustering of social Graphs, Direct Discovery of communities in a social
Graph, Introduction to recommender system.
Price : Rs. 90.00 (Rs. Ninty Only)
Edition : 2020
—®
pee tobig data, Big data characteristics, Types of big PAGE No,
data, Traditional versus big data, Evolution of big data
challenges with Big Data
Technol available for Big Data,
Use of Data Analytics, Desired prop
structure for Big data,
Data system..
unit.
Introduction to Hadoop, Core Hadoop components, Hadoop
Ecosystem, Hive Physical Architecture, Hadoop limi
RDBMS Versus Hadoop
MapReduce programtiming
UNIT
Introduction to Hive,
Query Language
Introduction to Pig, Anatomy of Pig, Pig on Hadoop, Use Case
for Pig, ETL Processing
Data types in Pig, running Pi
Operators, functions, Data types of Pig
UNITAV
Introduction to NoSQL, NoSQL Business Drivers, NoSQL Data
architectural patterns, Variations of NoSQL architectural
terns, using NoSQL to Manage Big Dat
Introduction to MongoDB -.
UNITV-
Mining social Network Graphs — :
Introduction, Applications of social Network mining, Social
Networks as a Graph, types of social Networks
Clustering of social Graphs, Direct Discovery of communities
ina social graph, Introduction to recommender system
jive Architecture, Hive Data types, Hive
eee
| UNIT
el INTRODUCTION OF BIG DATA
| INTRODUCTION TO BIG DATA, BIG DATA
CHARACTERISTICS, TYPES OF BIG DATA, TRADITIONAL
VERSUS BIG DATA, EVOLUTION OF BIG DATA,
CHALLENGES WITH BIG DATA
Q.1. What is big data ? Explain.
Ans. “Data of a very large size and typically to the exter
‘manipulation and management present significant logistical challen
as big data
The technologies and initiatives that involve data th diverse, fast-
changing or massive for conventional technologies, skills and infrastructure
to address effi is referred to as big data. The data sels that ae so large,
complex, and impractical to mmanage with traditional software tools are described
by big data.
But now the information from big data can be analyzed by using new
technologies ¢.g., user web clicks can be tracked by retailers to ide
behavioural trends that improve campaigns, pricing and stockage,
Major web companies such as Google, Amazon, and Facebook pioneered
businesses built on monetizing massive data volumes over the last decade:
The new paradigms not only for extracting value from data but also for
‘managing data and compute resources from data center design, to hardware,
‘0 software, to application provisioning were invented by them
Another definition of big data is as follows —
“The collection, processing, discovery, analysis and storage of large
| Nolumes and disparate types of data is enabled by the emerging technologies
and practices, very quickly and cost effectively”.
| Q.2. What is the importance of big data ?
Ans. The importance of big data depends upon it
fan be fetched from any source and analyzed to solve that enal
that its
known4 Big Data j
terms of ~
i) Cost redactions | shane eng ‘nroducton of 89 Date 5
| ethango, and sett customer information, We ate
cag and lca won. We are sceing new tend inthe
Pace, in which customer experienc rom one
Packaged, and sold to other industries, pd he eat
24. Disess four Vand fiveV's character
nmin me 24 Disess four Vand fveV's characteristics of big data with suitable
Finding the root cause of flues, issues and defect in ay | manage tad maine eo ne Fa pzins et soe,
Ve : ‘ast amounts of dat athe right speed atthe aoe
time, gia the right ns ition, Big dat gener
) Generating coupons a the point of sale secing the customer, sl a ights. In addition, Big data genersory smust
‘habit of buying goods.
Recelcuatng ene risk portfolios in just minutes,
(iv) Detecting fraudlent behaviour before it affects and risks ouy |
xrganizaton. "| hes tin deld based on some of is earacistis. Theo, es
: ' : churacteristieshave been sed to define Big Dats, earlier nownes Vs eon, ~=
0.3. Write short note on Drivers for big data. tuned a,
Ans, There ate three contributing factors or drivers for big data. These
drivers are consumers, automation and monetization,
More than cach of these contributing factors,
the creation of big data. With increasing automatio
mn and consumption opporti
| scalabie
‘consumers and the
snt marketplace for big
drivers are explained below ~
(Sophisticated Consumers ~The increase in information vel and
the associated tools has created a new breed of sophisticated consumers. These
consumers are fer more analytic, far savvier at using statistics, and far more
‘connected, using social media to rapidly collect and collate opinion from others.
(ii) Automation ~ Marketing and sales have received their biggest
boost in instrumentation from Intemet-driven automation over the past 10
Browsing, shopping, ordering, and customer service on the web not ‘only
user but also has created an enormous ‘
,, product and sales eae ie Fig. 1.1 Five V's Big Data Characteristics
1¢ buyer’s behavior. Each sequence of web clicks cap io ee ane ee
ri fers to the quantity of data gathered by a company.
dysphor® | ‘This data must be used further to gain im aye
collected, collated and analyzed for customer delight, puzzlement,
uence knowledge. Enterprises are
or outright defection. More information can also be obtained about se! awash with ever-growing data of all types, amassing terabytes even
leading upto a decision, . bytes of it ring 12 terabytes of tweets per day into
(iil) Monetization — A big data analytics perspectiver, a “¢4 tproved product sentiment analysis; o¢ converting 350 billion annual meter
is the biggest enabler to create an extemal market place where W® adings to better predict power consumption).:.,
6 Big Data
Moreover, Demchet
iarety—Itrefer tothe type of data that bg data ca com
ybe structured or unstructured. Big data consist deen
of data, including structured and unstructured data such as =
logfiles and so on. The analysis of combined ed)
situations, and so on, such as monitoring
of live video feeds from surveillance cameras 10 target pointe or
exploiting the 80% data growth in images, video and documents to Ines
customer satisfaction
(Gx) Value ~ It refers to the important feature of the dia which is
defined by the added-value thatthe collected data can bring tothe intended
process, activity or predictive analysis/hypothesis. Data value
the events or processes they represent such as stochastic, probe!
orrandom, Depending on
data, store for longer period
‘ypes brings new problems,
ated to the data volume and vatiey.
\e degree in which a leader trusts information
5
is very important for the business future. However, as
Jeaders do not trust the inf
big data presents a huge c!
generating tus it
lenge as the number and type of sources grows
Q5. Explain big data types with examples.
Or
a structured and unstructured data.
Write short note on structured beeen : mt fate yo resem
Ans. Big data encompasses everything, from dollar transactions 9 ee
trjintes audio. Therefore, taking advantage of big data reaules
this information to be integrated for analysis and data management
‘more difficult than it appears,
cludes huge volume,
onal database
stored in relations i
“ eS
i) Structured Data eae
‘able inthe format of row and col mata
| Web where various form of data need medium for interchanging the
‘ntroouetion of Big Date 7
ions by creating a model. The model allows
as well as gives permission
tured gery langage (SOL) wed formanagee hn eae
(i) Semistructured Data-Daia which ninth fomoveoeraca
Joes not fit the data moc as
This form of data increased rapidly after the introduction
ike XML and JSON,
Example ~ CSV, XML and JSON documents are semi-structured
‘NoSQL databases are considered as semi-structured.
‘tured Data ~ Data without any specific structure and
in a databank. Volume
‘ough to manage and anal
Fig. 1.2 Big Data Types
gp ——2.6. Give advantages of big data o,
er traditional data,
major source for storing and
about 30-40 years back. The
steuctured data and the
Introduction of Big Data 9
Data schema | Fixed schema | Dynamic schema
Preserves the
information in data
Cost etfective
[Accuracy |Less accurate results [High accurate | Confident results
results and reliable
Q.8. Describe history of big data.
_Ans, Big data is a long evolution of capturing and using of data and not 2
new phenomenon. Big data is the future act that will bring change in the way
‘wen society, just like the other developments in storage of data, processing,
data and inlernet. The ancient history of data is when humans used tally
ks for storing and analysis of data about C 1800 BCE. The tribal peoples
gi
and find the solution
price, improves perfor
architecture is based on microprocessors which is eco
to centralized database which is based on mainframe and distributed ave
has more computational power as compared to traditional. Traditimal ae |
systems are based on structured data whereas big data uses seri as well g| used 10 mark notches into bones or sticks for calculations, which would make
unstructured data, Traditional database store small amount ofcaaiihare| them predict sbout how long their food would last, One ofthe ealies prebistoric
from some giga-bytes to terabyte however big data can store and analyze dat| tt storage is Ishengo Bone now known as Uganda which was discovered in
ranging from hundreds of terabytes or petabytes and more. Storing large | '960. Then in C2400 BCE came the very first device particularly for performing
amount of data reduces the cost which will help the business intelligence (B)), coe paces: a ao a otamegon Gti dees
Bianchi acters Se a uy sane
which cannot be changed once saved. Traditional database system requ
complex and expensive software and hardware for managing large amount of
data, While in big data, the large data is divided into several systems, ths |
amount of data in each system is reduced. This makes the use of big =]
simple and cheap.
@.7. Compare traditional data and big data. ae
‘Ans. The comparison of traditional data and big data is given in table
‘Table 1.1 Comparison of Traditional Data and Big Data
mnomical 8s compared
data analysis. In 1880,
used punch eards for
years of work in 3 months designed
automated computation ete. Then
28 a German-Austrian
Advantage of
Traditional Data Big Data Big Data ‘Then came the Business Intelligence and start of large data centers
Data Ceowalized Distributed Cost effective wher deaf eaional daar nd Material Reuienen Planing ems
architecture |database database ty I Lamon int
a Improves vat first use ofthe term big deta was made by Erik Larson in the
ee ee eee wher be id at The fees of ig ata hey
Volume {Small amount of | Lange amount of forthe consumer's benefit, But deta have a way of being used
dala, Range ~ Giga-| Range _-
| i |S4 Prcesig coins, | own desk ste
unbounded steams of | fault-tolerant plupeale
Apste [Infasctie and Highest, a seme
p fp completeness SQL steam | Sensor, M2M, and tele | SQL-based, realtime st
(i) }Dmad Infrastructure and__| High performance dist fsscrver | matics applications aming big daa platform,
platiorm execution engine, good pro [Splunk Collect and hamess _| Fastand easy to use, dynamic|
rammability: machine data environments, sie from
(i Apache | Machine teaming | Good maturity. Taptop to datacenter
Jmahout algorithms in business (0) JApacke kafka| Distributed publish sub-| High-throughput sueam of
Gv) Jaspersoft |Businessintelligence | Cost-effective, self-service scribe messaging system | immutable activity data
Bisuite software BL at scale, (vi) {SAPHHana | Platform for real-time | Fast in-memory computing
(») [Pentaho | Business analytics | Robustness, scalability, fh —— aod bettie emalitic
bees platform bility in knowledge discovery iii) Big Data Tools Based on Interactive Analysis ~The interactive
assis | seta prevents th daa nan irate envroumen,aowag ro
(i) |Skytree | Machine leamingand | Process massive datasets undertake their own analysis of information. Users are dizetly connected to
server fadvanced analytics | accurately at high speeds. | the computer and hence can interact with it in realtime. The data ean be
(i) {Tabteax | Datavisuatization, | Faster, smart, ft, beatiful ard) viewed, compared and analyzed in tabular or graph format or both atthe
business analytics [easy t0 use dashboards. | stme time
(ii [Karmasphere} Big data workspace | Collaborative and standards (@)1n2010, Google proposed an interactive analysis system, named
ratio eat based unconstrained analyt, Dremel, which is scalable for processing nested data, Dremelhas very diferent
nay arene architecture compared! known Apache Hadoop, and acs asa successial
a ns. Ithas capability to run
(2) [aed open Pen angen! nd ond by means of combining
for interactive
Processing large amount of data in parallel. It provides a general
jictunism to distribute aggregate workload across differe
aoe 85, Hadoop is designed for baich processini
e @ real-time and high performance engin
‘hroughout latency in its implementations. Stream big dal
110 Google's Dremel. For drill, there is more
ious different query languages, data formats and14, Write short note
2 Introduction of Big Data. 21
Ans, For big dal
ws. This approach leads to faster
have been developed : realy a Whe pc ao ae
ized to store and query data
management system. Hadoop, is used for stor
Before processing big information it must be recorded feom
ig sources. In the order of its happening it mast
ought tobe recarde by mete
icipated or unpredi
‘data has to be different than that for tra
analytic tools and
: : Redundancy i npr: handle a reat dat of data
0.15, Describe the architecture for big data, Irom many sources. Redundancy comes in many forms. For instance, i he
Ans. The architecture for big data is shown in fig. 1.5. ‘company has created a private cloud, company may want to crete redundancy
win private areas so that it can scale out to support changing workoads. If 2
ae company needs to limit intemal IT growth, it may use external cloud services to
to
mm resources. In some cases, this redundancy may come inthe form
as a Service (Sax), allowing companies to carry out advanced data
ig Data Applications
Reporting aod Vwalzation catihuercians
“Analyties (Traditional and Advanced) £ cmographies or shifts in patient needs. This data about patients needs to be
Analytical Data Warehouses and Data Marts requirements and to protect patient privacy.
10 is allowed to see the data and when they
io be able to verily the identity of
as protect the identity of patients. These
"must be part of the big data fabric from the out set, and not an after thous
(x) Operational Data Sources — Concerning big 6a
nteruces and Feeds from/to the Internet
Fig, 1.5 Big Data Architecture
enone litte and Feeds What makes big data bigs the
‘ies on picking up lots of data from lots of sources. Therefore, open opp
faces (APIs) ae a core part of any big data architec
level and between every ayer hivecture also must work to
porting infrastructure of organization oF company.22 Big Dota
For instance, the company might be interested in running model, Intro
is safe to drill for oil in an offshore area, provided ae
10 how these data elements offer context ba
essed, With big data, rportingand dat vin
ontext of how data is related and the
iships on the future, oe
Big Date Applications ~ Tradonally busines is ancipeted
data would be used to answer questions about what o do and when todo
data oftemperature, salinity, sediment resuspension, and me
chemical, and physical properties of the water column
fun this model using a traditional server configuration. Howe,
distributed computing model, a day's long task may take
right also determine the Kind of database that company would wae
Certain circumstances, stakeholders may want to understand how su!
il distinct data elements are related, or the relationship between social no
a activity and growth in sales. This isnot the typical query the corppany wi
ask ofa structured, relational database. A graphical database might be g,74 sing the developm:
it
eboice, as it may be tailored to separate the Fes get advantage of the unig
“properties” or the information that defines that nd the “edge
relationship between nodes and properties. Using the right database may, met, AM of thes sp
application might beable to monitor pematre infest
‘termine if data indicates when intervention is needed. In mansfacturns, a
(i) Organizing Data Services and Tools ~ Indeed, 10a he big data application can be sed to prevent a machine from shuting down
that organizations use is operational. A growing amount of data comes fron, ‘a production un. A big data traffic management application may reduce
umber of sources that are not quite as organized or straightforwarc id, the number of traffic jams on busy city highways, decreasing the number of
data that comes from machines or sensors, and massive public and priv) accidents while saving fuel and reducing pollution.
data sources. In the past, most companies were not able to either capaci) 9.16, What do yow inean by big data analytics ? Explain various types
store this vast amount of data. It was simply too expensive of too overw! of analytics,
Even if companies are able to capture the data, they do not have the tools “Ans. Big dain analyticg, is the process of examining large data eet that
anything about it. Very few tools can make sense of these vast amoun's\ containing a variety of data types ie, big data to uncover all hidden pattems,
data, The tools that did exist were complex to use and did not produce 4 arknown correlations, market trends, customer preferences and other useful
within a reasonable time frame. In the end, companies who really want! business information, Then analytical findings can lead to more effecive
and technical applications,
| ' improve performance. Typically, a graph database may be used ia sci
do the enormous effort of analyzing this data were forced to work i) marketing, new revenue opportunities, better customer service, improved
Snapshots of data: This means that stakeholders may miss ott o” 7H operational efiiency, competitive advantages over rival organizations and
events as they may not have been captured in a certain snapshot. other business benefits.
ta Mares—Aferacont| — The primary goal of big data analytics is to help companies make more
s often import informative business decisions by enabling data scientists, predictive modellers
‘and other analyties professionals to analyse large volumes of transactional
data, as well as other forms of daa that may be untapped by more conventional
business intelligence (Bl) programs. That could include web server logs and
Internet click stream data social media content and social network activity
always 0 SPO, ex from customer e-mails and survey responses, mobile phone cll
on the capability to ereate reports to give them an understanding of W'" dctal records and machine data captured by sensors and connected 10 the
‘aia tells them about everything from monthly sales figures t0 pe Internet of Things.
{powth Big data changes the way the data is managed and used asec Bie data burst upon the scene in the first decade ofthe 21st cenur. od
Scola manage an aay enough data, tay we ET he tego to enact were nie an ts is ‘Arguably.
‘oolstoelp management truly understand the impact not just oF ims like Google, Linkedin, eBay and Facebook were round big data
(vii) Analytical Data Warehouses anc24 ‘Big Data
from the beginning. They did not have to reconcile or in
egrate bi Introduction of Big Data. 25
dag,
‘more traditional sources of data and the analytics perform :
they did not have that much of at They id each feral espe mec nate ig
big data technologies with thei TT infrastructures ns m4 a
infrastructures did mot exist tand alone, big ace Sy
could be the only focus of analytics, and big data technology acy" gush a5 ne alte Sali eaurcmaae
could be the only architecture. hig joo pounpe cece
Analytics can be classified into following three types — {versus building a new one.
(i). Predictive analytics, 0.17. Explain core components of analytical data architecture.
IR.GRY., May 2019 (VIII-Sem,)}
1 big data storage and analytics platform provides resources and
' for storage as well as for batch and real-time processing of the
vig provides main integration interfaces between the site operational
plstform and the cloud data lab platform and the programming interfaces for
{implementation of the data processes. The internal structure of the
Pig data trage and analytes platform s given in Fe 1.6
Gi) Descriptive analytics
Gi) Prescriptive analytics.
(Predictive Analytics —Predictive analysis establish
patterns and gives list of solutions which may come for gi
Predictive analysis study the present as well as past data and,
happen in future, give probal sed onl
big data to forecast other data which we do not have This analytical mel
is one of the most commonly used methods used for sales lead scoring sa}
media and consumer relationship management data
‘Three basic elements of predictive analytics are as follows —
ision analysis and optimization
ransaction profiling.
(i) Descriptive Analytics ~ Descriptive analytics also known
‘mining, ope ‘appening in real-time. It is one of the simples
of analytics as it converts big data into small bytes. The result
Fig. 1.6 The Internal Architecture of the Big Data Storage and Analytics
Platform
stored in the distributed file system, which s responsible
id replication of large datasets across the multiple servers
access tothe structured data is provided bythe distributed
standard SQL interface. The main component responsible
the distributed data processing framework, which provides
el API fr the implementation of the data pre-processing tasks and for
mn ofthe predictive functions. Predictive functions are-.
ing revenue, reducing oPeX, Chum and ony,
we as Key business objectives.
Introduction of Big Data. 29
ven eatment sage done
a prone o each dae near ns 7 ON
ee fon can take better decisions
respo stage umber o
vanced analytics where they are investing Now and where aia acs
hece yeas.
invest in th
Operators face an uphill challenge whet they need to g
compelling, revenue generating services without overload
(vi) In Agriculture ~ & biotechnology firm uses
treatment, increasing number
Frequent post treaiment
(iv) For Insurance Companies ~ Governent for giving medical
ia to patients do large amount of expenditure, By using BDA analysis,
prediction and minimizing fraud medical claims ean be done
IDA technique
ter period oft
ich are effective in
vernment analyzes this massive
tons between weather and disease
and accordingly preventive measures are taken. Public health surveillan
Ans, Advantages of using Big Data Analytics in Healthare Se|™?"°'% % "=U ie esPone io dssase outbreak quick by wing BD
Advantages of using big data analytics in healthcare sector are as
For Research ~The large amount of data produced, gives
Q.19. Explain advantages of using big data analytics in hea | 4s! 0 predict epidemics, by finding cor
sector and banking sector.
For Pharma Companies ~'To improve workflow quality and
quantity ike predictive modeling, statistical tool and algorithms. These improve
the oulcome of experiment and provide better understanding of developing
‘drugs pharma companies need new tools. Tis tool successfully navigates the
regulatory approval and marketing process.
Advantages of using Big Data Anal
Advantages of using big data analytics in banking
in Banking Sector ~
are as follows ~
ents is taken into account. The move is towards formuis
(patient on personalized treatment) on the geno
response to certain medicines, allergy, and family history. When gen«
kaown completely, some kinds of relations are
and the disease. T}
reputation or account
igh all the information and provide
fork process, save time and
formation and its proper knowledge allow organization to identify
1es before they affect their customers.
Fraud Detection and Prevention ~ One of the most important
ed by banking sectors is fraud. Big data
transactions are done and provides security as well as safety 10
re system,
lual. The patient gets advantage by various ways such as correc
“Selfectv line of treatment, beter health related decisions, prev
ime, continuous health monitoring of patients by wireless devices,
Petsonal line treatment, increase life quality and expectancy.
‘get readmit ater treatment, identi that
hosp identification of patents
Vagal, provider could develop pre health plans to prevent ho
‘queries that could be answered: jusing these BDA tools, include W"30. Big Data
iv) Enhanced Reporting - Getting 0263s to huge i
also contain different needs of different customers. Then ban ™%sr,| InroicSono Big Data 31
needs in a meaningful way. Banking industry provides the ct® of § ft of mritingl Talend rove source dt
required by the customer by using big data _ : Brows oa vendor that ys havea
(x). Risk Management ~ Early detection of
hat it has to coexist withthe
tary solution fora longtime for many restons, For example atin
trou Hadoop to a database req sting
aur cleansing and the datatype
ae whic i the case wi most ircumsta
sg
castomer, loyalty programs are created, Targeted marketing yy Tee abjegralpernec aire
‘made as well as relationships are build between valuable custome. i sn Hadoop eliste. SQL-H i sofware a
( Customer Feedback ~ Customers sont ze
problem.
collected in text form from various social media sites and afte
afer wat are the desired properties of big data system ?
aad negative TH alley, Q.24. What are the desired properties of big data syste
used to provideseniz) Ans, The desired properties of big data system are as follows —
; i) Error Tolerance and Robustness ~ Because ofthe cllenges
0.20. Explain open-source technology for big data analytics. ‘very much dificult to bud a system
Open-source software is computer sofware that savant “ao the Hah hing” Stems a Temieed Se Oy
te machines going dovin randomly, the complex semantics of wifors
and improve and at times also to di the software. The opensn cee 08 many more: These
tallenges make iteorplicat even
name came out of a 1998 meeting in Palo Alto in reaction to Nesey
sbusnes of big data system is needed to overcome the complexiiesassocited
announce! Source code release for Navigator (as Mozilla). fy ee . a
Although the source code is released, there are still governing bai (Ud) Scalability — itis the ability to maintain the performance with
agreements in place. The m pleis the GNUG ne growing data and load by adding resources to the system. The lambda
Public License (GPL), wt der the conti architecture is horizontally sealable across all layers of the system stack i.
further developments and applications are put unde the same licens.” Thi¥scaling is achieved by including more mumber of chines.
thatthe products keep improving over time for the greater population of (ii) Generalization ~ Avwide range of applications ean be function ina
‘Some other open-source projects are managed and supported by om uencral system. As lambda architecture is based on function ofall data, it generalizes
companies, such as Cloudera, that provide extra capabilities, taint Hal aplication, whet ani managers systems so ia ames
professional services that support open-source projects suchas Halo | - ae me aa |
similar to what Red Hat has done for the open-source project Lin! iv) Debuggabitity— Aig data system must provide the information
1 system when things go Wr should be able to
f the open-source analytics stack is ti! fequired to debug the system when things go wrong. We shou
to have that value.
ied by someone else’s predetermined ideas oF Vs
Champagne, chief technology officer at Revolution Analytics Ft
. "hature of the batch layer ‘by preferring to use recomputat
‘The open-source stack does not pul yo" ee *
Pu you con® upvhen p
5 ret (0) Ad hoc Queries ~ The ability to perform ad hoe queries on the
Gatais significant. Every large dataset contains unanticipated value init, Having
customer.
One ofthe great benefits of open-source lies inthe flexibility
model —
Yeu download and deploy itwihen you need said YS32 Big Data
the ability of data
WADOOP
INTRODUCTION TO HADOOP, CORE HADOOP COMPONENTS,
HADOOP ECOSYSTEM, HIVE PHYSICAL ARCHITECTURE,
HADOOP LIMITATIONS, RDBMS VERSUS HADOOP
need low latency re
aire the update latency requirements may vary widely. In som.
pes eed propagate immediately, but nother applications uptelgg
of few hour is allowed. 0.1. Whats Hadoop ?
| ans Hadoop was developed inthe year of 2008 by Doug Cuting and
Ilitc Carta tis the Apache open source software which allows to sore
Hea is huge volume of Jian a dstbuedenvzoomen nd
ten in ava Hadoop is also called MRI The major social networking ses
ae ney book. Yohoo, Google, Twit and Linkedln uses the Hadoop
5 chnology are fst, sealable,
} loop consist of two main frame work Map reduce layer and
prs layer Map reduce layer is used for processing the big dat (wherethe
Serapplicaion executes) and 11DFS is used wo store the big dat whenethe
bern zesien :
| 0.2. Expluin main components of Hadoop.
| Ans. Two main components of Hadoop are as follows ~
(The Hadoop Distributed File System (HDES) SHDFS is the
DES breaks it‘Computer Cluster
Hedoop 95
Hadoop’s parallel world has following two major layers —
‘Processing/computation ayer is called MapReduce
Storage layer is called adoop Distributed Fite Sytem (HDS)
(Gut. Explain the ecosystem of Hadoop.
vans. tladoop is an open source framework maintained by the Apache
on for reliable, sealable and distributed computing According tothe
hadoop apache org the eomponents of Hadoop ae defined as projects
‘imeiion different to cach other's. Some of the widely used Hadoop
(). Pig ~ tis a platform for HDFS. It consists of 2 compiles for
eduer programs and a high-level language called Pig Latin. I provides
perform daa extractions, transformations and fading andbasi nays
nt having to write MapReduce programs.
iy Hive — It's a distributed data warehouse. A data warehouse and
tinge that presents data in the form of tables. Hive
to database programming. (It was initially developed
Fig. 2.1 HDFS & Map Reduce
(i) Map Reduce ~ Because Hadoop stores the entire da
‘small pieces across a number of servers, analytical jobs can be dis
to each of the servers storing part of the data. Each ser
fragment simt
a comprehensive answe
3 Facsbook).
(iy HBase—
1fHadoop. HBase tab
(iv) Zookeeper ~
= isa data mining software that can be easily
are designed it (0) Mahout ~ Mabout is a data mining software that ¢
signed to continue to work even if there are failures. HDFScon, 4. Syanout offers java libraries or scalable machine learning algorithm
monitors the data stored on the cluster. If a server becomes unavil ic Toa ee svg the dat, These machine Kaming algorithms
disk drive fails or data is damaged due to hardware or software Pic, user to perform a task such as classification, clustering, association rule
HDFS automatically restores the data from one of the known slats, ‘and predictive analysis.
ter{ MapReduce monitors te ROB) (9) Casandra —Hadoop Cassandra provides datas that an be
ating in the job, when an analysis job i ui sy scalable and highly avaiable without interruption in the job performance
of them is'slow in returning an answer or fail (il) Chukwa — Chukvva isa data collections system which is mainly
MapReduce automatically starts another-instance-of the task °% [titer tects outcomes of the collected data
‘a non-relational, distributed database that runs on
‘serve as input and output for MapReduce jobs,
is an application that coordinates distributed
server that has a copy of the data. i 1g system whicl. is used for
Because of the way that HDFS and MapReduce work, Had00? Mbnfiguring the Hadoop cluster for fast processing ‘of Hadocp data. Spark
sealable, reliable and fault-tolerant services for data storage ané axilifes not use MapReduce job of execution engine to run the job. It uses its
‘wn distributed runtime to complete the job.
| Gx) Tez — Tez is a data-flow programming language build in the
0.3. Write short note on Hadoop’s parallel world. fier ‘Yam to execute an arbitrary DAG of tasks to process data for both
Ans. The Hadoop framework app oat atch and interactive use-case
Provides distibuted storage and computation across clusters °F
Hadoop is designed to scale up from single server to thousands of online
each offering local computation and storage.
(x) Avro ~ Avro is used for data seri which provides a
file for storing persistent data. Avro was ereated by Doug Cutting36 Big Date
formaking Hadoop tobe writable in many programy
‘JavaScript, Python, Ruby. MeHg
a web interface for managin RAM Size —For using Hadoo
components. o oe
Processor ~ Two or more core processors are needed for
data between Hadoop
(xiv) Oozie ~Itis a Hadoop job scheduler.
‘The Hadoop ecosystem is shown in fig. 2.2.
fe] TaN [ome] adoop
AWS} tntosphere cone] aca ding. Java, Eclipse, and Intell are also a
use these software
iho available from commer
and Horton works are
Tey charge for service. Their fee version can be used butte software
“ppt is not available when needed.
"Quer the requirements are mect, the Hadoop software canbe installed
fe of cost to get started with the simple project. Later the software and
vradware can be upgraded to work on more complex project with bigger
jolumes and variety of big dats.
es 0.6. What is S40
Ans. Sqoop is mainly used to transfer ie huge amount data between Hadoep
land relational database. Sqoop refers “SQL to Hadoop and Hadoop to SQL I
‘witer |] pce || premet the relational database such as Mya Orci, poste SO
= 91'S, Hive, HBase) and exports the data from HDFS to relational
Tel
Fig, 2.2 The Hadoop Ecosystem
0.5. What are the system requirements for installing Hade
Ans, There is vation
data. In order to address
in the hadoop apache.org webs
hardware and software requirement for using single node clust*
listed below ~ op are
noe the migration of heterogeneous data
(a) Operating System — Hadoop project c2 Pe ee
the Linux oF Windows operating system. Windows 10 versio®
system has been found to be most efficient.38 Big Data ‘i - bi
0.7, What is zookeeper ? Also write its advantages ang
‘Ans In a traditional distributed environment coordinating 9%, Hadoop 99
askisquite complex and complicated. Butthe zookeeperoveyee on May iges of Maliout ~
with the help of simple architecture and its API. In the cluster nes th a
to maintain the shared data and coordinating among then.
isa
doesn't support sala version in the development
hhas no decision tree algorithm,
ly developed at Yahoo for thei complex work
scarch engine. Later it was acquired by open source Apache incu
1 orflow scheduler for managing Hadoop jobs. There are two major types of
| rovie jobs are available, ic. oozie workflow and oaze
sesie workflow it follows Directed Acyclic Graph (DAC
| SEeuentil execution of jabs inthe Hadoop.
| ‘The control flow node controls the begi
xecution. In the oozie coordination, workflow jobs are triggered by time.
sg advantage of Once —
Advantages of Zookeeper ~ | () It allows the workflow of execution can be restarted from the
@_ Itprovides reli
(i 1 offers high synchronization and serialization.
(iii) The atomicity eliminates the inconsistency of data among chu
is fast and simple. |
Disadvantages of Zookeeper ~
failure.
It provide web service API (ie. we can control the jobs from
ages of Oozie ~
not a resource scheduler.
Its not suitable for off grid scheduling.
(The large number of stacks needs to be maintained. 0.10. Give some applications of Hadoop.
4. What is Mahout ? Give its advantages and disadvantages. Ans, Now-a-days, with the rapid growth ofthe data volume, the storage
os Epes tah eae Tromework wd 24 processing of Big Data has become the most pressing reeds ofthe
: Mahont is the Apache open source software framework #)| enterprises. Hadoop as the open source distributed computing platform has
provides data mining library. The processing task can be split in © ™H) become a brilliant choice for the business, The users can develop their own
‘segments and each segment can be computed on a different machine isc distributed ‘ations on Hadoop and processing Big Data even if they do
to speed up the computation process. The primary goals of the Melos) not know the bottom-level details of the system. Due tothe high performance
statistical modeli® of Tadoop, it has been widely used in many companies. Some applications of
1g and machine es!) Hadoop are given below ~
(2) Hadoop in Yahoo! ~ Yahoo! isthe leade in Hadoop technology
Haadoop on various products, which include
i, content optimization, anti-spam e-mail system, and advertising
iy used in user interests) prediction,
searching ranking, and advertising location.
data clustering, classification, regression t
collaborative filtering, It provides scalable data minin
(it makes the decision based on the current and previous history ©
approaches for the data,
Advantages of Mahout —
(It supports complementary and distributed naiv
classification.
Cc
a0
and Linkedin intemally uses Mahout for data mining.
In the Yahoo! home page pet the real-time service system
Temines the huge volume of data. silica the data from he database othe intrest maping tough the Apacs,
‘The companies such as Adobe, Twitter, Foursquare, Fae" | Every 5 minutes, the system will rearrange the contents ‘based on Hadoop
yi Tvatien cluster and update the contents every 7 minutes.
- (iv) Yahoo uses it for pattem mining. Hl
5 > a _40 Big Dato
including search, tog
videolmage analysis,
every day. At present, the larg i ing, recommendation syslems, data warehousing
pache Hadoop is an open-source
(ii) Hadoop in Facebook ~ Facebook is the
‘om 2004 to 2009, Facebook has over &
ed everyday is huge. This means that
contains con
sharing, comments, and users access histories. These da
process so Facebook has adopted the Hadoop and Hie
Q.11. Why Facebook has chosen Hadoop ? sters of commodity servers and each of thane
ss Facebook is developing it discovered that MySQL cangg, | serves local CPUS and dik storage tat canbe evra by the system
“ reauiremens. Afr longtem research and experiment, Faber. iagnap Arc
anove Hdoop and Hbase as the data processing syst. The ma] yrs ore and proces
Facebook choose the Hadoop and Hbas has the two aspects. Ontiren of computes wing i
base meets the requirements of Facebook. Hbase can suppor these) romaigle servers thousands ofmactines
{othe data, Although Hbase does not suppor the traditional outer fomsopeag_ with high degre of fault ewe
the Hbase column oriented storage model brings high fleibiliy soni’ Hadoop cluster is broken down i
, i sand distributed throughout the
«ier form, HIbase is also a good choice for int eae eee
huge dat, support the complex index with the flexible sea
the speed of data access. On the other hand, Facebook has the confi
*olve the Hadoop problems in real use. For now, Hbase has already benii| foie don processing.
provide high consistency and high throughput key-value storage Hadoop framework includes four
Namenode asthe only manager node in the HDFS may become the bit quel
ofthe system. Then, Facebook has designed a high availabilty Namenod 2 i) Hadoop Common ~ Taey contain
AvatatNode to solve this problem. In the aspect of the fault tolerance that are required by other Hadoop modules, The .
‘can tolerate and isolate faults in the subsystem of the disk. The failures! tem and OS level abstraction. It contains necessary Java files
‘hole clusters of Hbase and HDFS are part of fault tolerance system. that are required to start Hacoop
Overall, according to the improvements by the Facebook, net
meet the Facebook most requirements and can provide a stable effcies
safe service for the Facebook users,
2.12. What are the advantages of Hadoop ? Explain Hadoop ari
and its components with proper diagram. [R.GRV., May 2019 (WH
Ans. Advantages of Hadoop — =
(The scalability and elasticity of free open source Hadoo? )
on Standard hardware allow organizations to hold onto more da! os
advantage of all ir data to increase opera tional efficiency and fa)
edge. Hadoop supports complex analysis across large collections ©!
‘one tenth the cost of traditional solutions,
, and this provides the scalability needed
large scale data processing this is
Hadoop Refer to Q4.
hes Dieses42 Big Data
Q.13. Explain the Hive physical architecture,
with Hadoop. The main components of Hive are - YM ity
External Interfaces ~ Hive
fine (CLI) and web UI, and appl
JDBC and ODBC.
The Hive Thrift Serverexposes a
faces “ ‘can be stored and queried o0. On droppi
inet my, Creating a8 Internal Tale ~
- CREATE TABLE STUDENTS (roll_nursber INT,
ame STRING, age INT, address STRING)
ides both user inte
ton programming
very simple client API tog
Stalements. Thrift isa framework for cross-language service wneHig) ROW FORMAT DELIMITED
writen in one language (like Java) can also support cliente: Other) FIELD TERMINATED BY“,
‘The Thrift Hive liens generated in different languages arcuees Ma, (di) External Table —
drivers like JDBC (java), ODBC (C++), oe
“RY INT, name STRING, age INT, address STRING)
The Driver manages the life eycle ofa HiveQL statement
seiner igang) ROW FORMAT DELIMITED
ean et, fomtete| _FIELD'TERMINATED BY"
LOCATION”
ROW FORMAT should have delimiters used o terminate the fel and
es ike inthe above example the flds are terminated wih cor (7
(014 Give the linttations of Hadoop.
ations of Hadoop areas fellows —
(_ Secarty Conceras-Hadoops missing encryption athe songe
which sa major ition fom goverment agencies and
others organizations point of view tht prefer to keep es dala unr Wrap
(i Vudnerable by Nature ~ Speaking of security the very makeup
| of Hadoop makes runing i a Fisky propos
| almost entirely in Java. thas been heavily exploited by cyber cr
Hive > load data locallnPath ‘ home/hadoop/fi
Select Command:
Hive>select*from students;
Tht
Server
5 a result, it is not recommended for organizations
small quantities of data.
QUIS. Differentiate Hadoop vs distributed data base.
peared [R.GPY., May 2019 (VITI-Sem)]
Ans. Differences between Hadoop and distributed data base areas follows —
[Parameter RDBMS ‘Hadoop
Type of date Structured data with Unstructured and
known schemes structured
|_| Pata groups Records, long fields, Files
objects, XML.
Fig. 2.5 Hive Physical ArchitectureData modification | Updates allowed
SQL & XQuery
Simple ile compra
30+ years of innovation
g | Batch proce Streaming access tof,
files
Acceptance Large DBA and appl
development comm
widely used,
ad
HADOOP DISTRIBUTED FILE SYSTEM, PROCESSING DATA
Hecocp 48
ifferent slaves? nodes
oe ts ind can be inere
Me shown in the fig. 2.6. M
a pe ote slave node
erode manages the file
1FS isa distributed system
DPS el
rat ne difrences elec.
WITH HADOOP, MANAGING RESOURCES AND APPLICATION"
WITH HADOOP YARN, MAPREDUCE PROGRAMMING
Ans, HDES also known as Hadoop Distribute
Hadoop components which handles the storage of
to.add more storage inthe system,
then they can easily increase the
ty by adding
IDFS coi
‘are broken
blocks whi
}) Replica placement
jeartbeat and block report
HDFS high throughput ace:
|, t-HDES blocks ar rg compart isk locks, inorder’ misinine
the cost of sosks, By making block large enough, the time
tly larger than the time to seek othe
fom di canbe ma sient er st46 Big Data é . ua
0.19. Explain the architecture of Hadoop
(Hpk). nelbieg
Ans. HDFS is the master/siave structure. The N, “
i Hadoop 47
Jocks stored in the form of redundancy backup inthe D:
fo in the Datanode,
data storage lists tothe Namenode regularly stat
Datanode,
Metadata
Opn we short note on the followings ~
a Authority management of IDES.
Ga) Tintin of HDES.
“Authority Management of HDES ~ DPS shares» silat
esd
im to POSIX. Each file of directory has an owner and a group,
sions for the files or the directories are different o the
Tin the same group, and other users. On the one hand, forthe
Jee sets are required the authority to read and the authority to wit,
Sete other hand, for the directories, users need the ~F authority to list the
rectory content and -W authority 10 ereate or delete, Unlike the POSIX
[Yotem, there is no sticky, setuid or setgid of directories because there is n0
‘poneept of executable files in HDFS.
Limitations of HDFS ~MDFS es the opensource implementation
manager ofthe HDPSis equ! OFS (Gol File System) ison excelent ditibtd flesyem and ts
the legate Te a atlmany aanges HDFS was designdtonin oath icp comms nvae
inate Sie system, I wil pt al theo expensive machines, This ens that he robs of ode ulus
files directories. At the same time, Namenode al sms gghly high, To give a fll considerati sign of HDES, we may find
relations between each file and the locctay oy han Pc acs HDS has nt only advantages bat for dealing wih some specific
masnaores aie cation ofthe data block. Dates pebens, The limitations of HDFS ars fllows~
the hard d I data in the system. However, all the data is pts
{he hard dives but will be collected when the system sats to ide
server of the required documents, ey
‘The
isonly, ae eo {a backup node for the Names 6 stead. Because HDFS has only one single Master system,
obviously become the in the Hadoop cluster environment, the Newel need to be processed by the Master, When there is @ huge
Msstichee es DAS bur of requests, ther isan inevitable day, Cures ‘there are some
This is the reason why Hadey et the whole operation eto projects to aes this imitation, sch a Hse wes the Upper
aliemativebuskap, Treg doo? designed the Secondary ‘Namencée?Dala Management project to manage the data
conpantt backup. The Secondary Namenodeusualy runsonasepaeh (b) Poor St Per
ofthe reas communication at certain time interval to kee? ines aes
immed metadata withthe Namenode so that it can 0%" NTN
is by ce.
sea lace where the real datas saved and tel be
‘management ofthe name
files metadata into
IDES neads to use the
respond to the client
determined by the48. Big Data
possibleto manage milins fils. However, when he Files exten,
the work pressures onthe Namenode is hevier andthe time of
‘is unacceptable. ae
0.21, Describe in detail about dataftow of ie read in Drs,
Ans, To get an idea of how data flows between the elient int
HDFS, the Namenode and the Datanode, consider the fig. 2.8, wine’
the main sequence of eveats when reading a file,
® copy of that block. Fur
their proximity to the cl
a MapReduce task, for
‘manages the Datanode and.
on the stream (step 3). DFSInj
Addresses for the rin
pack tothe client, which call
the end of the block is reach
of view is just reading
ith the DFSlnputSuream opening,
‘lo call the Namenode to retrieve the Datanode locations f
blocks as needed. When the client has finished reading
FSDatalnputStream (step 6).
One important aspect of this design is that the client contacts Datanodes
directly to retrieve data, and is guided by the Namnenode tothe best Datanode for
tach block. This design allows EIDFS to scale to large number of concurrent
liens, since the data traffic is spread across all the Datanodes in the cluster. The
[Namenode meanwhile merely has to service block location requests (which it
sores in memory, making them very efficent), and does not, for example, serve
data, which would quickly become a bottleneck as the number of clients grew.
(0.22, Describe in detail about dataflow of fle write in HDFS.
ide isthe case of ereating anew file,50 Big Data
DistributedFileSystem returns a InputFor
; yrmat is also
{he input splits and dividing them into records, The dats
of splits (typically 64/128Mb) in HDFS. An input
that is processed by a single map, oe
InputFormat class calls the get
each file and then sends them t
PORIbIe foe |
ided ing
ts() function and computes
5
jobtracker, whi a
locations to schedule map tks to process then one ees Bsa |
tasktracker, the map task passes the split tothe createRecerdia neo Ong
on InputFormat to obtain a RecordReader for t 8) ety
loads data from its source and converts into key-value pairs sunk Rede
by mapper. The default InputFormat is TextinputFormat whieh =
value of input a new value and the associated key is byte offer "tS *%
A RecordReader is little more than an iterator over rec
oF Over records,
task uses one to generate record key-value pairs, which i passes |
function. We ean see this by looking at the Mapper’s run(} method
public void run(Context context) throws IOException, Interrupted xceptn
setup(context); J
while(context.nextKeyValue( )) {
‘map(Context.getCurrentKey( ), context.getCurrentValue ), conten)
}
cleanup(context);
}
SDataOutputStream for the client to start writing data to. Just as inthe
read case, FSDataOutputStream wraps a DFSOutputStream, which handles
tion with the Namenode.
consumed by the DataStreamer, whose responsi
to allocate new blocks by picking a li
‘we will assume the replication
1¢. The Data Streamer streams
the packet and
larly, the second d'8
\d last) Datanode io the
ine (step 4). DFSOurputStream also maintains an intemal queue of packet
waiting to be acknowledged by Datanodes, called the ack queue
removed from the ack queue only when it has been acknowled!
‘the Datanodes in the pipeline (step 5).
if a Datanode fails while data is being written to it, then the fol ye
‘sctions are taken, which are transparent to the client writing the data. Fist
od)
Jine is closed, and any packets in
queue are added to the front
fhe ek uc co tht Datos
sara from the failed
js given a new
yunicated to
eet wcrc
a ee omsepe
a bebe
i polluaeta
inges for a further rept ell
1d on another node, Subsequent p
Sick av hen ested as neal
‘Wheo the clenthas finshed writing
datait calls close( ) on the stream (step
6 This action flushes all the remaining
packets to the Datanode pipeline and
knowledgements before
1¢ Namenode to signal
is complete (step 7). The
already knows which
blocks the file is made up of (via Data
Streamer asking for block allocations),
0 itonly has to wait for blocks to be
‘minimally replicated before returning
successfully.
2.23. What is the Google file system ? Explain architecture of =
Ans. The Google File System (GFS) isa sealable distributed 9 file ae
lage disteibuted data intensive applications. It provides fel tolerance WE
‘unning on inexpensive commodity hardware, ad it deliver heh SebsA
Performance toa large numberof clients. GES provides familias OE
‘terface, though it does not implementa standard APL suchas POST FI
Onganized hierarchically in directories and identified by path-names SA)
the usual operations such as create, delete, ope, close ea
!
i