2 Big Data Analytics-Hadoop R21 A7902 ABP
2 Big Data Analytics-Hadoop R21 A7902 ABP
(A7902) (VCE-R21)
UNIT-2
THE BIG DATA TECHNOLOGY LANDSCAPE
NoSQL (Not Only SQL), Types of NoSQL Databases, SQL versus NoSQL, Introduction to
Hadoop, RDBMS versus Hadoop, Distributed Computing Challenges, Hadoop Overview,
Hadoop Distributors, HDFS (Hadoop Distributed File System), Working with HDFS
commands, Interacting with Hadoop Ecosystem.
balances the load of data and query on the available servers; and if and when a server goes
down, it is quickly replaced without any major activity disruptions.
8) Replication: It offers good support for replication which in turn guarantees high availability,
fault tolerance, and disaster recovery.
Choice When the data needs consistent Big Data processing, which does not
relationship require any consistent relationships
between data
Cost Cost around $10,000 to $14,000 Cost around $4,000 per terabytes of
per terabytes of storage storage.
Distribution Features
Cloudera Distribution Hadoop suite of open-source/ premium technologies
CDH 6.X to store, process, discover, model, serve, secure
and govern all types of data.
HPE Ezmeral Data Fabric (formerly MapR Dfs and data platform for the diverse data
Data Platform) needs of modern enterprise applications
2) DataNode
There are multiple DataNodes per cluster.
During Pipeline read and write DataNodes communicate with each other.
A DataNode also continuously sends “heartbeat” message to NameNode to ensure the
connectivity between the NameNode and DataNode.
In case there is no heartbeat from a DataNode, the NameNode replicates that DataNode
within the cluster and keeps on running as if nothing had happened.
Heartbeat report is a way by which DataNodes inform the NameNode that they are up and
functional and can be assigned tasks.
Team manager is able to allocate tasks to the team members who are present in office. The
tasks for the day cannot be allocated to team members who have not turned in.
3) Secondary NameNode
The Secondary NameNode takes a snapshot of HDFS metadata at intervals specified in the
Hadoop configuration.
Since the memory requirements of Secondary NameNode are the same as NameNode, it is
better to run NameNode and Secondary NameNode on different machines.
In case of failure of the NameNode, the Secondary NameNode can be configured manually
to bring up the cluster. However, the Secondary NameNode does not record any real-time
changes that happen to the HDFS metadata.
Note: When you start hadoop, for some time limit hadoop stays in safemode. You can either wait
(you can see the time limit being decreased on Namenode web UI) until the time limit or You can
turn it off with
hadoop dfsadmin -safemode leave
sudo –u hdfs hdfs dfsadmin –safemode leave
The above command turns off the safemode of Hadoop / HDFS
2. ls <path> Lists the contents of the directory specified by path, showing the names,
permissions, owner, size and modification date for each entry.
hdfs dfs -ls / =lists directories and files at the root of HDFS
hdfs dfs –ls /user = lists directories and files in the user directory
In cloudera type in browser type
http://localhost:50070/ to view dfs health http://localhost:50070/explorer.html#/ to
view hdfs directories through Browse Directory panel
3) Hive is a Data Warehousing Layer on top of Hadoop. Hive can be used to do ad-hoc
queries, summarization, and data analysis on large data sets using an SQL-like
language called HiveQL. Anyone familiar with SQL should be able to access data stored
on a Hadoop cluster.
4) Pig is an easy-to-understand data flow language that helps with the analysis of large
datasets. It abstracts some details and allows you to focus on data processing. Pig Latin
scripts are automatically converted into MapReduce jobs by the Pig interpreter to
analyze the data in a Hadoop cluster.
5) ZooKeeper is a coordination service for distributed applications. jobs.
6) Oozie: It is a workflow scheduler system to manage Apache Hadoop
7) Mahout: It is a scalable machine learning and data mining library.
8) Flume/Chukwa: It is a data collection system for managing large distributed systems.
9) Sqoop is a tool used to transfer bulk data between Hadoop and structured data stores
such as relational databases. With the help of Sqoop, you can import data from RDBMS
to HDFS and vice-versa.
10) Ambari: It is a web-based tool for provisioning, managing, and monitoring Apache
Hadoop clusters.
-*-*-