Big Data Unit1
Big Data Unit1
Apart from the above three data types, Metadata is • This form of data is either textual or binary.
another important type of data in Big Data A text file may contain the contents of
environment. various tweets or blog postings.
Structured data conforms to a data model or contain image, audio or video data.
schema. It is often stored in a tabular form. This Special purpose logic is usually required to process
makes it easier for any program to sort, read and and store unstructured data.
process the data. It is most often stored in a
Unstructured data cannot be directly processed or
relational database.
queried using SQL. If it is required to be stored
Structured data is frequently generated by within a relational database, it is stored in a table as
enterprise applications and information systems a Binary Large Object (BLOB).
like ERP and CRM systems. Examples of this type
Alternatively, a Not-only SQL (NoSQL) database
of data include banking transactions, invoices, and
is a non-relational database that can be used to store
customer records.
unstructured data alongside structured data.
The following symbol can be used to represent
structured data:
Semi-structured Data:
Figure: Symbol to represent Structured Data. structure and consistency. A semi-structured data is
hierarchical or graph-based.
Unstructured Data:
This kind of data is commonly stored in files that
Data that does not conform to a data model or data
contain text. For example, XML and JSON files are
schema is known as unstructured data.
common forms of semi-structured data, and is
• Unstructured data has a faster growth rate than shown in the following figure:
structured data. The following figure shows
some common types of unstructured data:
Volume
Veracity
Government: A very interesting use of Big • The Map job takes a set of data and
Data is in the field of politics to analyse patterns converts it into another set of data. Then
and influence election results. Cambridge it broke down the individual elements
Analytica Ltd. is one such organisation which into key/value pairs (tuples).
completely drives on data to change audience • The Reducer phase takes place after
behaviour and plays a major role in the electoral mapper phase has been completed. The
process output of a Mapper or map job (key-
Healthcare: Big data is very useful in the value pairs) is input to the Reducer. The
healthcare industry. Important clinical reducer receives the key-value pair from
patient disease patterns can be studied from the Then, the reducer aggregates those intermediate
patient’s electronic health records (EHR). It will key-value pair into a smaller set of key-value pairs as
help to improve patient care and improve the final output.
efficiency.
The following figure shows the functioning of a
Telecom: Big data analytics can help the MapReduce:
Communication Service Providers (CSPs)
improve profitability by optimizing network
services/usage, enhancing customer experience,
and improving security.
Map Reduce
Now, we will find the unique words and the So that all the tuples with the same key are sent
number of occurrences of those unique words. to the corresponding reducer.
The MapReduce process has the following Reducing: Count the values which are present
steps: in that list of values. For example, get a list of
Splitting: First, we divide the input in three values which is [1,1,1,1] for the key Amaravati.
splits as shown in the following figure. This will Then, count the number of ones in the very list
distribute the work among all the map nodes. and give the final output as – Amaravati, 2.
Andhra Loyola College is managed and 4.1. Change the current directory to the
MAPREDUCE folder:
administered by the members of the Society of
Jesus (Jesuits), a Catholic religious order, cd/usr/local/hadoop/share/hadoop/mapreduce
which has rendered signal service in the fields Now, we can execute the wordcount example by
of education and service to humanity for over using
hadoop jar
450 years. The college was founded in
December 1953 at the request of the Catholic mention the name of the jar: jar hadoop-mapreduce-
examples-2.9.0.jar
bishops of Andhra Pradesh and began its
mention the class: wordcount and give the Input A BRIEF HISTORY OF HADOOP
file which you placed in HDFS
• 2004—Initial versions of what is now
Hadoop Distributed Filesystem and Map-
Reduce implemented by Doug Cutting and
and also specify the name of the Output file in Mike Cafarella.
which you want to display your result. • December 2005—Nutch ported to the new
For example: framework. Hadoop runs reliably on 20
nodes.
4.2 Executing the File • January 2006—Doug Cutting joins Yahoo!.
• February 2006—Apache Hadoop project
hadoop jar hadoop-mapreduce-examples- officially started to support the standalone
2.9.0.jar wordcount /alc /alcresult development of MapReduce and HDFS.
• February 2006—Adoption of Hadoop by
5. Displaying the Results Yahoo! Grid team.
Displaying the results from the output file has • April 2006—Sort benchmark (10 GB/node)
the following steps: run on 188 nodes in 47.9 hours.
• May 2006—Yahoo! set up a Hadoop
5.1We can check the output using the following research cluster—300 nodes.
command:
• May 2006—Sort benchmark run on 500
nodes in 42 hours (better hardware than
hadoop dfs -ls /alcresult
April benchmark).
It will list you two files:
• October 2006—Research cluster reaches
600 nodes.
• A SUCCESS FILE
• December 2006—Sort benchmark run on
• A PART file.
20 nodes in 1.8 hours, 100 nodes in 3.3
hours, 500 nodes in 5.2 hours, 900 nodes in
For example, as shon below:
7.8 hours.
Found 2 items
• January 2007—Research cluster reaches
-rw-r--r--1 hadoopusr supergroup 0 2018-
900 nodes.
12-02 15:21 /alcresult/_SUCCESS
-rw-r--r--1 hadoopusr supergroup 582 2018- • April 2007—Research clusters—2 clusters
12-02 15:21 /alcresult/part-r-00000 of 1000 nodes.
• April 2008—Won the 1 terabyte sort
*The output/result is available with the PART benchmark in 209 seconds on 900 nodes.
file. • October 2008—Loading 10 terabytes of
data per day on to research clusters.
5.2 Read the part file using the cat command • March 2009—17 clusters with a total of
24,000 nodes.
• April 2009—Won the minute sort by
hadoop dfs -cat /alcresult/part-r-00000
sorting 500 GB in 59 seconds (on 1,400
nodes) and the 100-terabyte sort in 173
minutes (on 3,400 nodes).
* We can also check the result in GUI using a
browser window by connecting to the localhost
with a valid port number 50070.
YARN
• Hadoop
YARN (Yet Another Resource Negotiator)
is a Hadoop ecosystem component that
provides the resource management.
• Yarn is also one the most important
component of Hadoop Ecosystem.
• YARN is called as the operating system of
Hadoop as it is responsible for managing and
Hadoop Distributed File System (HDFS) monitoring workloads.
• It allows multiple data processing engines
• It is the most important component of Hadoop such as real-time streaming and batch
Ecosystem. processing to handle data stored on a single
• HDFS is the primary storage system of platform.
Hadoop. Hadoop distributed file system
(HDFS) is a java based file system that Hive
provides scalable, fault tolerance, reliable and
cost efficient data storage for Big data. • The Hadoop ecosystem
• HDFS is a distributed filesystem that runs on component, Apache Hive, is an open
commodity hardware. source data warehouse system for querying
• HDFS is already configured with default and analyzing large datasets stored in
configuration for many installations. Hadoop files.
• Hadoop interact directly with HDFS by • Hive do three main functions: data
commands. summarization, query, and analysis.
• Hive use language called HiveQL (HQL),
which is similar to SQL.
Flume • HiveQL automatically translates SQL-like
• Flume efficiently collects, aggregate and queries into MapReduce jobs which will
moves a large amount of data from its origin execute on Hadoop.
and sending it back to HDFS.
• It is fault tolerant and reliable mechanism.
This Hadoop Ecosystem component allows
the data flow from the source into Hadoop
environment.
• It uses a simple extensible data model that
allows for the online analytic application.
Mahout Zookeeper
• Mahout is open source framework for
creating scalable machine • Apache Zookeeper is a centralized service and
learning algorithm and data mining library. a Hadoop Ecosystem component for
maintaining configuration information,
• Once data is stored in Hadoop HDFS,
naming, providing distributed synchronization,
mahout provides the data science tools to and providing group services.
automatically find meaningful patterns in • Zookeeper manages and coordinates a large
those big data sets. cluster of machines.
Pig Oozie
HBase
R Connectors
• Apache HBase is a Hadoop ecosystem
component which is a distributed
• Interfaces to work with Hive tables, the
database that was designed to store
Apache Hadoop compute infrastructure, the
structured data in tables that could have
local R environment, and Oracle database
billions of row and millions of columns.
tables
• HBase is scalable, distributed, and • Predictive analytic techniques, written in R
NoSQL database that is built on top of or Java as Hadoop MapReduce jobs, that can
HDFS. be applied to data in HDFS files
• HBase, provide real-time access to read
or write data in HDFS.
Ambari
• Ambari, another Hadoop ecosystem
component, is a management platform for
provisioning, managing, monitoring and
securing apache Hadoop cluster.