0% found this document useful (0 votes)
39 views

2 Big Data Analytics-Hadoop R21 A7902 ABP

Uploaded by

Kola Siri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

2 Big Data Analytics-Hadoop R21 A7902 ABP

Uploaded by

Kola Siri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

BIG DATA ANALYTICS

(A7902) (VCE-R21)

UNIT-2
THE BIG DATA TECHNOLOGY LANDSCAPE

NoSQL (Not Only SQL), Types of NoSQL Databases, SQL versus NoSQL, Introduction to
Hadoop, RDBMS versus Hadoop, Distributed Computing Challenges, Hadoop Overview,
Hadoop Distributors, HDFS (Hadoop Distributed File System), Working with HDFS
commands, Interacting with Hadoop Ecosystem.

THE BIG DATA TECHNOLOGY LANDSCAPE


 The big data technology landscape can be majorly studied under two important technologies:
1. NoSQL
2. Hadoop

2.1 NoSQL (Not Only SQL)


 The term NoSQL was first coined by Carlo Strozzi in 1998 to name his lightweight,
open-source, non-relational database that did not expose the standard SQL interface.
 The term was reintroduced by Eric Evans in early 2009.
 NoSQL databases are widely used in big data and other real-time web applications.
 NoSQL databases is used to stock log data which can then be pulled for analysis.
 Likewise it is used to store social media data and all such data which cannot be stored
and analyzed comfortably in RDBMS.

2.1.1 Features & Advantages of NoSQL


1) Non-relational: They do not adhere to relational data model. They are either key-value pairs
or document-oriented or column-oriented or graph-based databases.
2) Open source & Distributed: They are distributed meaning the data is distributed across
several nodes in a cluster constituted of low-cost commodity hardware., distributed
databases.
3) Cheap, easy to implement: Deploying NoSQL properly allows for all of the benefits of
scaling up and down, high availability, fault tolerance, etc. while also lowering operational
costs.
4) Dealing with a rich variety: It can house large volumes of structured, semi-structured, and
unstructured data.
5) Dynamic schema: NoSQL database allows insertion of data without a pre-defined schema. In
other words, it facilitates application changes in real time, which thus supports faster
development, easy code integration, and requires less database administration.
6) Relaxes the data consistency requirement: They have adherence to Brewer’s CAP
(Consistency, Availability, and Partition tolerance) theorem. They do not support ACID
properties.
7) Auto-sharding: It automatically spreads data across an arbitrary number of servers. The
application in question is more often not even aware of the composition of the server pool. It

Andraju Bhanu Prasad, Associate Professor, CSE 1


BIG DATA ANALYTICS
(A7902) (VCE-R21)

balances the load of data and query on the available servers; and if and when a server goes
down, it is quickly replaced without any major activity disruptions.
8) Replication: It offers good support for replication which in turn guarantees high availability,
fault tolerance, and disaster recovery.

2.2 Types of NoSQL Databases


 NoSQL databases are non-relational.
 They can be broadly classified into the following:
1) Key−value or the big hash table.
2) Document-Oriented
3) Column oriented
4) Graph oriented
1) Key-value store
 It maintains a big hash table of keys and values, where
each key is unique, and the value can be a JSON, Key Value
BLOB(Binary Large Objects), string, a number and also Name Bhanu
an entire new set of key-value pairs encapsulated in an
object. Age 24
 This kind of NoSQL database is used as a collection, Last Name Engineer
dictionaries, associative arrays, etc. Key value stores help
Height 179cm
the developer to store schema-less data. They work best
for shopping carts, user preferences, and user profiles. Weight 79Kg
 Ex: DynamoDB, Redis, Riak, etc.

2) Column oriented NOSQL databases


 Stores data in tables, rows and dynamic columns.
 Column databases store each column separately, allowing for quicker scans when only a
small number of columns are involved.
 They deliver high performance during analytics and reporting on aggregation queries like
SUM, COUNT, AVG, MIN etc. as the data is readily available in a column.
 Widely used to manage data warehouses, business intelligence, CRM, Library card
catalogs, IoT data and user profile data
 Ex: Cassandra, HBase, Hypertable, Google BigTableetc.

Andraju Bhanu Prasad, Associate Professor, CSE 2


BIG DATA ANALYTICS
(A7902) (VCE-R21)

3) Document-Oriented NoSQL Databases


 It stores and retrieves data as a key value pair but the value part is stored as a document.
 They are designed to store everyday documents as is like Newspapers or magazines
articles.
 The document is stored in JSON or XML formats. The value is understood by the DB and
can be queried.
 It should not use for complex transactions which require multiple operations or queries
against varying aggregate structures.
 The document type is mostly used for CMS systems, blogging platforms, real-time
analytics & e-commerce applications.
 Ex:Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB,
{ “Book Name”: “Big Data Analytics”,
“Publisher”: “Wiley India”,
“Year of Publication”: “2011”
}

4) Graph-Based NoSQL Databases


 They are also called network database.
 The entity is stored as a node with the
relationship as edges. An edge gives a
relationship between nodes. Every node and
edge has a unique identifier.
 Compared to a relational database where
tables are loosely connected, a Graph database
is a multi-relational in nature. Traversing
relationships as fast as they are already
captured into the DB, and there is no need to calculate them.
 Graph base databases mostly used for social networks, logistics, spatial data.
 Ex: Neo4J, Infinite Graph, OrientDB, FlockDB

2.3 SQL Vs NoSQL


Key Areas SQL NoSQL
Type of database Relational database Non-relational, distributed database
Model Relational model Model-less approach
Schema Pre-defined schema Dynamic schema for unstructured data

Database Table based databases Document-based or graph-based or


Categories wide column store or key-value pairs
databases

Andraju Bhanu Prasad, Associate Professor, CSE 3


BIG DATA ANALYTICS
(A7902) (VCE-R21)

Key Areas SQL NoSQL


Scalability Vertically scalable (by Horizontally scalable (by creating a
increasing system resources) cluster of commodity machines)
Language Uses SQL Uses UnQL (Unstructured Query
Language)
Handling data Handles data coming in low Handles data coming in high velocity
velocity and small datasets and large datasets
Online processing Used for OLTP Used for OLAP
Hierarchical data Not a best fit for hierarchical Best fit for hierarchical storage as it
storage data follows the keyvalue pair
Base properties Emphasis on ACID properties Follows Brewer’s CAP theorem
External support Excellent support from Relies heavily on community support
vendors
Complex queries Good for complex queries Not good fit for complex queries
Consistency Can be configured for strong Few support strong consistency (e.g.,
consistency MongoDB, Cassandra)
Examples Oracle, DB2, MySQL, MS SQL, MongoDB, HBase, Cassandra, Redis,
PostgreSQL, etc. Neo4j, CouchDB, Couchbase, Riak

2.4 Introduction to Hadoop


 How Much Data Is Created Every Day in 2020?
 1.7MB of data is created every second by every person during 2020.
 In the last two years alone, the astonishing 90% of the world’s data has been created.
 2.5 quintillion bytes of data are produced by humans every day.
 463 exabytes of data will be generated each day by humans as of 2025.
 95 million photos and videos are shared every day on Instagram.
 By the end of 2020, 44 zettabytes will make up the entire digital universe.
 Every day, 306.4 billion emails are sent, and 5 million Tweets are made.
 4.4 million blog posts are posted every day.
 To process, analyze, and make sense of these different kinds of data, we need a system that
scales and addresses these challenges.
2.4.1 Why Hadoop?
 With the new paradigm, the data can be managed
with Hadoop as follows:
1) Distributes the data and duplicates chunks of each
data file across several nodes, for example, 25–30
is one chunk of data as shown in Figure.
2) Locally available compute resource is used to
process each chunk of data in parallel.

Andraju Bhanu Prasad, Associate Professor, CSE 4


BIG DATA ANALYTICS
(A7902) (VCE-R21)

3) Hadoop Framework handles failover smartly and


automatically.
2.4.2 Why not RDBMS?
 RDBMS is not suitable for storing and processing large
files, images, and videos.
 RDBMS is not a good choice when it comes to
advanced analytics involving machine learning.
 Figure describes the RDBMS system with respect to
cost and storage. It calls for huge investment as the
volume of data shows an upward trend.
2.4.3 History of Hadoop
 Hadoop is a collection of open-source software utilities that facilitates using a network
of many computers to solve problems involving massive amounts of data and
computation.
 The core of Apache Hadoop software framework consists of a storage part, known as
Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce
programming model.
 It is an open-source project of the Apache Software Foundation.
 It is a framework written in Java, originally developed by Doug Cutting and Mike
Cafarella in 2005 who named it after his son’s yellow toy elephant.
 Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.
 Hadoop is now a core part of the computing infrastructure for companies such as Yahoo,
Facebook, LinkedIn and Twitter, etc.
2.5 RDBMS Vs Hadoop
Parameters RDBMS Hadoop

System Relational Database Node Based Flat Structure


Management System.
Data Suitable for structured data Suitable for structured, unstructured
data. Supports variety of data formats
in real time such as XML, JSON, text
based flat file formats, etc.

Processing OLTP Analytical, Big Data Processing

Choice When the data needs consistent Big Data processing, which does not
relationship require any consistent relationships
between data

Processor Needs expensive hardware or In a Hadoop Cluster, a node requires


high-end processors to store only a processor, a network card, and
huge volumes of data few hard drives

Cost Cost around $10,000 to $14,000 Cost around $4,000 per terabytes of
per terabytes of storage storage.

Andraju Bhanu Prasad, Associate Professor, CSE 5


BIG DATA ANALYTICS
(A7902) (VCE-R21)

2.6 Distributed Computing Challenges


 There are two major challenges with distributed computing:
1) Hardware Failure: In a distributed system, several servers are networked together and
may be a possibility of hardware failure. And when such a failure does happen, how does
one retrieve the data that was stored in the system?
 Hadoop has an answer to this problem in Replication Factor (RF) which means the
number of data copies of a given data item/data block stored across the network.
2) How to Process This Gigantic Store of Data? In a distributed system, the data is spread
across the network on several machines. A key challenge here is to integrate the data
available on several machines prior to processing it.
 Hadoop solves this problem by using MapReduce Programming. It is a programming
model to process the data.

2.7 Hadoop Overview


 Hadoop is a Open-source software framework to store and process massive amounts of
data in a distributed fashion on large clusters of commodity hardware.
 Basically, Hadoop accomplishes two tasks:
1. Massive data storage.
2. Faster data processing.
2.7.1 Hadoop Features/Advantages/ Key considerations/aspects
1) Open-source (License Free): Hadoop is an open-source framework whose code can be
modified and changed as per business requirements. It is License Free to Download,
Install and work with.
2) Meant for Big data Analytics: Hadoop can handle high Volume, Variety, Velocity & Value
Big Data.
3) Distributed storage and Parallel processing: Data is stored in a distributed manner in
HDFS across the cluster (multiple computers) in its native format, and data is processed
by MapReduce programming using Massive Parallel Processing (MPP) technique on a
cluster of nodes, such that entire programs run in less time.
4) Shared Nothing Architecture: It is a cluster with independent machines, that every node
perform its job by using its own resources. HDFS has no structure and is much schema-
less.
5) Horizontally Scalable: Owing to its scale-out architecture, as data keeps on growing, we
keep adding nodes. It can integrate seamlessly with cloud-based services.
6) Data Locality: Hadoop works on data locality principle which states that move
computation to data instead of data to computation.
7) Cost-effective: Hadoop has a much- reduced cost/terabyte storage and processing data as
it uses commodity hardware (cheap machines that does not require a very high-end
server). Hadoop 3.0 have only 50% of storage overhead as opposed to 200% in Hadoop
2.x.
8) Replication: It replicates (mirror copy) its data across multiple nodes in the cluster by 3X
Replication Factor. If one node crashes, the data can still be processed from another node
that stores its replica, thus, data is reliably stored, highly available and accessible.

Andraju Bhanu Prasad, Associate Professor, CSE 6


BIG DATA ANALYTICS
(A7902) (VCE-R21)

9) Fault-tolerant / Resilient to failure: Due to Replication Factor, Hadoop MapReduce can


quickly recognize faults that occur and then apply a quick and automatic recovery
solution.
10) Flexible for varied data sources: It works with all kinds of data: structured, semi-
structured, and unstructured data to derive meaningful business insights from email
conversations, social media data,, log analysis, data mining, recommendation systems,
market analysis, etc.
11) Fast: HDFS implements a mapping system to store and locate data in a cluster.
MapReduce programming is also located in the very same servers and thus allows faster
data processing.

2.7.2 Versions of Hadoop


1) Hadoop 1.0
2) Hadoop 2.0 : HDFS continues to be the data storage framework.
3) Hadoop 3.0

2.7.3 Hadoop Components


 The core aspects (components) of Hadoop include the following:
1) Hadoop Distributed File System (HDFS): HDFS component takes care of the storage
part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS
creates multiple replicas of data blocks and distributes data across compute nodes in a
cluster. This distribution enables reliable and extremely rapid computations.
2) MapReduce: MapReduce is a computational framework and software framework for
writing applications which are run on Hadoop. These MapReduce programs splits a task
across multiple nodes and processes enormous data in parallel.

2.7.4 High-Level Architecture of Hadoop


 Hadoop is a distributed Master-Slave Architecture. A Hadoop cluster consists of a single
master and multiple slave nodes.
 Master node is known as NameNode and Slave nodes are known as DataNodes.
 Figure 3.4.5 depicts the Master–Slave Architecture of
Hadoop Framework.
 Key components of the Master Node:
1) Master HDFS: Its main responsibility is partitioning the
data storage across the slave nodes. It also keeps track
of locations of data on DataNodes.
2) Master MapReduce: It decides and
schedules computation task on slave
nodes.

Andraju Bhanu Prasad, Associate Professor, CSE 7


BIG DATA ANALYTICS
(A7902) (VCE-R21)

2.8 Hadoop Distributions and Vendors


 At its core, Hadoop is an Open-Source system which means anyone can freely download
and use the core aspects (components) of Hadoop.
 Hadoop distributions are used to provide scalable, distributed computing against on-
premises and cloud-based file store data.
 Distributions are composed of commercially packaged and supported editions of open-
source Apache Hadoop-related projects.
 Distributions provide access to applications, query/reporting tools, machine learning and
data management infrastructure components.

Distribution Features
Cloudera Distribution Hadoop suite of open-source/ premium technologies
CDH 6.X to store, process, discover, model, serve, secure
and govern all types of data.

Horton-Works Data Platform open-source framework for distributed


HDP 3.X storage and processing of large, multi-source
data sets. (Horton -Elephant)

HPE Ezmeral Data Fabric (formerly MapR Dfs and data platform for the diverse data
Data Platform) needs of modern enterprise applications

Oracle Big Data Appliance Version - X8-2L Open-source, flexible, high-performance,


secure platform for running diverse workloads
on Hadoop, Kafka and Spark.
IBM InfoSphere powerful, scalable ETL platform of all data
Version - 8.1 types across on-premises and cloud
environments
Azure HDInsight platform cloud distribution of Hadoop components
Version - 4.0 which is easy, fast, and cost-effective to
process massive amounts of data.
Elastic MapReduce Hadoop Distribution web service (Cloud distribution) to manage
6.X huge Big Data datasets, web indexing, financial
analysis, scientific simulation, and
bioinformatics.
BigQuery Google Cloud Platform cloud-based big data analytics web service for
processing very large read-only data sets.

Pivotal Big Data Suite Premium, Analytical SQL on Hadoop Engine


Pivotal HD with HAWQ and Greenplum DB HAdoop With Queries

Andraju Bhanu Prasad, Associate Professor, CSE 8


BIG DATA ANALYTICS
(A7902) (VCE-R21)

2.9 Hadoop Distributed File System


Key aspects / Functionalities/ Points
 Some key Points of Hadoop Distributed File System are as follows:
1) Storage component of Hadoop.
2) Distributed File System.
3) Modeled after Google File System.
4) Optimized for high throughput (HDFS leverages large block size and moves computation
where data is stored).
5) You can replicate a file for a configured number of times, which is tolerant in terms of both
software and hardware.
6) Re-replicates data blocks automatically on nodes that have failed.
7) You can realize the power of HDFS when you perform read or write on large files
(gigabytes and larger).
8) Sits on top of native file system such as ext3 and ext4.

2.9.1 HDFS Architecture


 Client Application interacts with NameNode for metadata related activities and
communicates with DataNodes to read and write files. DataNodes converse with each
other for pipeline reads and writes.

Fig 3.8.2: Hadoop Distributed File System Architecture


 Let us assume that the file “Sample.txt” is of size 384 MB. As per the default data block
size (128 MB), it will be split into three blocks and replicated across the nodes on the
cluster based on the default replication factor.
 Hadoop 1.x can configure to 64MB (default) while
 Hadoop 2.x and Hadoop 3.x cluster can have 64MB/ 128MB (default) / 256MB/ 512 MB.
Hadoop Administrator have control over block size to be configured for Cluster.

Andraju Bhanu Prasad, Associate Professor, CSE 9


BIG DATA ANALYTICS
(A7902) (VCE-R21)

HDFS Architecture Daemons


1) NameNode
 HDFS breaks a large file into smaller pieces called blocks.
 NameNode uses a rack ID to identify DataNodes in the rack. A rack is a collection of
DataNodes within the cluster.
 NameNode keeps tracks of blocks of a file as it is placed on various DataNodes.
 NameNode manages file-related operations such as read, write, create, and delete. Its
main job is managing the File System Namespace.
 A file system namespace is collection of files in the cluster. NameNode stores HDFS
namespace. File system namespace includes mapping of blocks to file, file properties and
is stored in a file called FsImage.
 NameNode uses an EditLog (transaction log) to record every transaction that happens to
the filesystem metadata.
 When NameNode starts up, it reads FsImage and EditLog from disk and applies all
transactions from the EditLog to in-memory representation of the FsImage. Then it flushes
out new version of FsImage on disk and truncates the old EditLog because the changes are
updated in the FsImage. There is a single NameNode per cluster.
 Hadoop stores data in the Hadoop Distributed File System (HDFS).
 A huge data file is divided into multiple blocks and each block
is stored over multiple nodes on the cluster.
 By default, each block is 128 MB in size.
 In the below diagram, 542 MB is divided into 4 blocks of 128
MB each and 30 MB.
 The block size can be scaled up and down as per
needed.

2) DataNode
 There are multiple DataNodes per cluster.
 During Pipeline read and write DataNodes communicate with each other.
 A DataNode also continuously sends “heartbeat” message to NameNode to ensure the
connectivity between the NameNode and DataNode.
 In case there is no heartbeat from a DataNode, the NameNode replicates that DataNode
within the cluster and keeps on running as if nothing had happened.
 Heartbeat report is a way by which DataNodes inform the NameNode that they are up and
functional and can be assigned tasks.
 Team manager is able to allocate tasks to the team members who are present in office. The
tasks for the day cannot be allocated to team members who have not turned in.

3) Secondary NameNode
 The Secondary NameNode takes a snapshot of HDFS metadata at intervals specified in the
Hadoop configuration.
 Since the memory requirements of Secondary NameNode are the same as NameNode, it is
better to run NameNode and Secondary NameNode on different machines.

Andraju Bhanu Prasad, Associate Professor, CSE 10


BIG DATA ANALYTICS
(A7902) (VCE-R21)

 In case of failure of the NameNode, the Secondary NameNode can be configured manually
to bring up the cluster. However, the Secondary NameNode does not record any real-time
changes that happen to the HDFS metadata.

Figure 3.8.2 NameNode and DataNode Communication.

2.9.2 Anatomy of File Read


The steps involved in the File Read are as follows:
1) The client opens the file that it wishes to read from by calling open() on the
DistributedFileSystem.
2) DistributedFileSystem communicates with the NameNode to get the location of data
blocks. NameNode returns with the addresses of the DataNodes that the data blocks are
stored on. Subsequent to this, the DistributedFileSystem returns an FSDataInputStream to
client to read from the file.
3) Client then calls read() on the stream DFSInputStream, which has addresses of the
DataNodes for the first few blocks of the file, connects to the closest DataNode for the first
block in the file.
4) Client calls read() repeatedly to stream the data from the DataNode.
5) When end of the block is reached, DFSInputStream closes the connection with the
DataNode. It repeats the steps to find the best DataNode for the next block and
subsequent blocks.
6) When the client completes the reading of the file, it calls close() on the FSDataInputStream
to close the connection.

Figure 3.8.3a Anatomy of File Read.

Andraju Bhanu Prasad, Associate Professor, CSE 11


BIG DATA ANALYTICS
(A7902) (VCE-R21)

2.9.3 Anatomy of File Write


The steps involved in anatomy of File Write are as follows:
1) The client calls create() on DistributedFileSystem to create a file.
2) An RPC call to the NameNode happens through the DistributedFileSystem to create a new
file. The NameNode performs various checks to create a new file (checks whether such a
file exists or not). Initially, the NameNode creates a file without associating any data
blocks to the file. The DistributedFileSystem returns an FSDataOutputStream to the client
to perform write.
3) As the client writes data, data is split into packets by DFSOutputStream, which is then
written to an internal queue, called data queue. DataStreamer consumes the data queue.
The DataStreamer requests the NameNode to allocate new blocks by selecting a list of
suitable DataNodes to store replicas. This list of DataNodes makes a pipeline. Here, we will
go with the default replication factor of three, so there will be three nodes in the pipeline
for the first block.
4) DataStreamer streams the packets to the first DataNode in the pipeline. It stores packet
and forwards it to the second DataNode in the pipeline. In the same way, the second
DataNode stores the packet and forwards it to the third DataNode in the pipeline.
5) In addition to the internal queue, DFSOutputStream also manages an “Ack queue” of
packets that are waiting for the acknowledgement by DataNodes. A packet is removed
from the “Ack queue” only if it is acknowledged by all the DataNodes in the pipeline.
6) When the client finishes writing the file, it calls close() on the stream.
7) This flushes all the remaining packets to the DataNode pipeline and waits for relevant
acknowledgments before communicating with the NameNode to inform the client that the
creation of the file is complete.

Figure 3.8.3b Anatomy of File Write

Andraju Bhanu Prasad, Associate Professor, CSE 12


BIG DATA ANALYTICS
(A7902) (VCE-R21)

2.10 Working with HDFS Commands


 There are many more commands in "$HADOOP_HOME/bin/hadoop fs" than are
demonstrated here, although these basic operations will get you started.
 Running ./bin/hadoop dfs with no additional arguments will list all the commands that
can be run with the FsShell system.
 $HADOOP_HOME/bin/hadoop fs -help commandName will display a short usage
summary for the operation.
 The following conventions are used for parameters:
• "<path>" means any file or directory name.
• "<path>..." means one or more file or directory names.
• "<file>" means any filename.
• "<src>" and "<dest>" are path names in a directed operation.
• "<localSrc>" and "<localDest>" are paths as above, but on the local file system .
 The 'hdfs dfs' command is used very specifically for hadoop file system (hdfs) data
operations while
 'hadoop fs' covers a larger variety of data present on external platforms as well. These
external platforms include the local filesystem data as well.
1. Version displays the Hadoop
framework version and respective Hadoop
distribution version
 hadoop version
 hdfs version

Note: When you start hadoop, for some time limit hadoop stays in safemode. You can either wait
(you can see the time limit being decreased on Namenode web UI) until the time limit or You can
turn it off with
 hadoop dfsadmin -safemode leave
 sudo –u hdfs hdfs dfsadmin –safemode leave
The above command turns off the safemode of Hadoop / HDFS

2. ls <path> Lists the contents of the directory specified by path, showing the names,
permissions, owner, size and modification date for each entry.
 hdfs dfs -ls / =lists directories and files at the root of HDFS
 hdfs dfs –ls /user = lists directories and files in the user directory
In cloudera type in browser type
http://localhost:50070/ to view dfs health http://localhost:50070/explorer.html#/ to
view hdfs directories through Browse Directory panel

Andraju Bhanu Prasad, Associate Professor, CSE 13


BIG DATA ANALYTICS
(A7902) (VCE-R21)

S.No. Create files and directories


 Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data
into HDFS first.
 Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name.
 This isn‘t automatically created for you, though, so let‘s create it with the mkdir
command directory.
 For the purpose of illustration, we use chuck. You should substitute your user name in
the example commands.
1. –mkdir <path/FolderName>
To create a named directory in given path of HDFS. In Hadoop dfs there is no home directory
by default.
 hdfs dfs –mkdir /ABP
 hdfs dfs –mkdir /ABP/abpsubdir =creates sub directory
 hdfs dfs –mkdir /user/ABP1 = creates sub directory in user
2. Create new file with content in local file system (files present on OS).
gedit abpfile1
Create a file with name “abpfile1”, type the content in it.
Similarly create abpget1 file.
To view files of local file system, go to cloudera desktop double click cloudera’s home icon.
Its path for programming is /home/cloudera/
A) Adding files and directories to HDFS
3. –put <localSrc> <hdfsDest>
To copy file/folder from local file system to HDFS store. Both files exist
 hdfs dfs –put /home/cloudera/abpfile1 /user/ABP1/
=file copied and exists in both locations
 hdfs dfs –ls –R user/ABP1/ =check for destination location file

4. –copyFromLocal <localSrc> <hdfsDest> (Identical to –put)


To copy a file/folder from local file system to HDFS store. Both files exist
 hdfs dfs –copyFromLocal /home/cloudera/abpfile1 /user/ABP1/
5. –moveFromLocal <localSrc> <hdfsDest>
To move a file/folder from local file system to HDFS store. Works like –put, but deletes
moved file/folder from lfs, and exists only in HDFS store.
 hdfs dfs –moveFromLocal /home/cloudera/abpget1 /user/ABP1/

B) Retrieving files from HDFS to local filesystem


6. –get <hdfsSr> <localDest>
To copy a file/folder from HDFS to local file system. Both files exist.
 hdfs dfs –get /user/ABP1/abpfile1 /home/cloudera/abpfile2

Andraju Bhanu Prasad, Associate Professor, CSE 14


BIG DATA ANALYTICS
(A7902) (VCE-R21)

7. –copyToLocal <hdfsSrc> <localDest> (Identical to –get)


To copy a file/folder from HDFS to local file system. Both files exist
 hdfs dfs –copyToLocal /user/ABP1/abpget1 /home/cloudera/
8. –moveToLocal <hdfsSrc> <localDest>
To move a file/folder from HDFS store to local file system. Works like -get, but deletes moved
file/folder from HDFS store, and exists only in lfs.
 hdfs dfs –moveToLocal /user/ABP1/abpget1 /home/cloudera/
9. –cp <hdfsSrc> <hdfsDest>
To copy the file or directory from given source to destination within HDFS.
 hdfs dfs –cp /A123 /user/
10. –mv <hdfsSrc> <hdfsDest>
To move the file from the specified source to destination within HDFS.
 hdfs dfs –mv /user/ABP1/abpfile1 /A123
11. –cat <filen-ame>
To display the contents of an HDFS file on console.
 hdfs dfs –cat /home/cloudera/abpget1
C) Deleting files from
12. –rm -r <hdfsFilename/hdfsDirectoryName>
Deletes a file from HDFS recursively.
hdfs dfs –rm /user/A123/abpfile1 =removes file
hdfs dfs –rm -r /user/ABP1 = recursively removes directory and all its contents
13. – du <path>
Shows disk usage, in bytes, for all the files which match path
hdfs dfs –du /user/A123/abpfile1

2.11 Interacting with Hadoop Ecosystem


 Hadoop is a framework that enables processing of large data sets which reside in the form
of clusters.
 Being a framework, Hadoop is made up of several modules that are supported by a large
ecosystem of technologies.
 Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions.
 The following are the components of the Hadoop ecosystem:
1) Hadoop Distributed File System (HDFS): It simply stores data files as close to the
original form as possible.
2) HBase is a column-oriented NoSQL database for Hadoop and compares well with an
RDBMS. It supports structured data storage for billions of rows and millions of
columns. HBase provides random read/write operation. It also supports record level
updates which is not possible using HDFS. HBase sits on top of HDFS.

Andraju Bhanu Prasad, Associate Professor, CSE 15


BIG DATA ANALYTICS
(A7902) (VCE-R21)

3) Hive is a Data Warehousing Layer on top of Hadoop. Hive can be used to do ad-hoc
queries, summarization, and data analysis on large data sets using an SQL-like
language called HiveQL. Anyone familiar with SQL should be able to access data stored
on a Hadoop cluster.
4) Pig is an easy-to-understand data flow language that helps with the analysis of large
datasets. It abstracts some details and allows you to focus on data processing. Pig Latin
scripts are automatically converted into MapReduce jobs by the Pig interpreter to
analyze the data in a Hadoop cluster.
5) ZooKeeper is a coordination service for distributed applications. jobs.
6) Oozie: It is a workflow scheduler system to manage Apache Hadoop
7) Mahout: It is a scalable machine learning and data mining library.
8) Flume/Chukwa: It is a data collection system for managing large distributed systems.
9) Sqoop is a tool used to transfer bulk data between Hadoop and structured data stores
such as relational databases. With the help of Sqoop, you can import data from RDBMS
to HDFS and vice-versa.
10) Ambari: It is a web-based tool for provisioning, managing, and monitoring Apache
Hadoop clusters.

-*-*-

Andraju Bhanu Prasad, Associate Professor, CSE 16

You might also like