15CS82 Module 2
15CS82 Module 2
Module 2
Essential Hadoop Tools
▪ To begin first, copy the passwd file to a working directory for local Pig
operation: $cp/etc/passwd.
▪ Next, copy the data file into HDFS for Hadoop Map Reduce operation: $
▪ hdfs dfs –put passwd passwd.
▪ To confirm the file is in HDFS by entering the following command: hdfs dfs –ls
passwd
▪ -rw-r--r- 2 hdfs hdfs 2526 2015-03-17 11:08 passwd.
▪ In local Pig operation, all processing is done on the local machine (Hadoop is not
used). First, the interactive command line started: $ pig -x local.
▪ If Pig starts correctly, you will see a grunt> prompt.
▪ And also see a bunch of INFO messages. Next, enter the commands to load the passwd
file and then grab the user name and dump it to the terminal.
▪ Pig commands must end with a semicolon (;).
▪ grunt> A = load 'passwd’ using Pig Storage (':');
▪ grunt> B = foreach A generate $0 as id;
▪ grunt> dump B;
▪ The processing will start and a list of user names will be printed to the screen.
▪ To exit the interactive session, enter the command quit.
o $ grunt> quit.
▪ Sqoop is a tool designed to transfer data between Hadoop and relational databases.
▪ Sqoop is used to
-import data from a relational database management system
(RDBMS) into the Hadoop Distributed File System (HDFS),
- transform the data in Hadoop and
- export the data back into an RDBMS.
1) Sqoop examines the database to gather the necessary metadata for the data to be
imported.
2) Map-only Hadoop job : Transfers the actual data using the metadata.
▪ The imported data are saved in an HDFS directory.
▪ Sqoop will use the database name for the directory, or the user can specify any
alternative directory where the files should be populated. By default, these files contain
comma delimited fields, with new lines separating different records.
1. Download Sqoop.
3. Add Sqoop user permissions for the local machine and cluster.
4. Import data from MySQL to HDFS. 5. Export data from HDFS to MySQL.
Step 1: Download Sqoop and Load Sample MySQL Database
To install sqoop,
# yum install sqoop sqoop-metastore To download
database,
$ wget http : //downloads.mysql.com/docs/world_innodb.sql.gz
Step 2: Add Sqoop User Permissions for the Local Machine and Cluster.
In MySQL, add the following privileges for user sqoop to MySQL.
mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'limulus' IDENTIFIED
BY 'sqoop';
mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'10.0.0.%'
▪ Data transport involves a number of Flume agents that may traverse a series of machines
and locations.
▪ Flume is often used for log files, social media-generated data, email messages, and just
about any continuous data source.
▪ A Flume agent must have all three of these components defined. Flume agent can have
several source, channels, and sinks.
▪ Source can write to multiple channels, but a sink can take data from only a single
channel.
▪ Data written to a channel remain in the channel until a sink removes the data.
▪ By default, the data in a channel are kept in memory but may be optionally stored on
disk to prevent data loss in the event of a network failure.
▪ As shown in the above figure, Sqoop agents may be placed in a pipeline, possibly to
traverse several machines or domains.
▪ In this Flume pipeline, the sink from one agent is connected to the source of another.
▪ The data transfer normally used by Flume, which is called Apache Avro.
▪ Simply enter the hive command. If Hive start correctly,it get a hive> prompt.
$ hive
(some messages may show up here) hive>
▪ Hive command to create and drop the table. That Hive commands must end with a
semicolon (;).
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 1.705 seconds
▪ To see the table is created,
hive> SHOW TABLES; OK
pokes
Time taken: 0.174 seconds, Fetched: 1 row(s)
▪ To drop the table,
hive> DROP TABLE pokes;
OK
Time taken: 4.038 seconds
▪ The first step is to Creation of table can be developed using a web server log file:
hive> CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6
string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
▪ Next, to load the data from the sample.log file, the file is found in the local directory
and not in HDFS.
hive> LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO
TABLE logs;
▪ Finally, the select step that this invokes a Hadoop MapReduce operation. The results
appear at the end of the output.
o hive> SELECT t4 AS sev, COUNT(*) AS cnt FROM logs
Department of CS&E, GMIT, Bharathinagara 8
Big Data and Analytics (15CS82)
Distributed-Shell
▪ Distributed-Shell is an example application included with the Hadoop core components that
demonstrates how to write applications on top of YARN.
▪ It provides a simple method for running shell commands and scripts in containers in parallel
on a Hadoop YARN cluster.
Hadoop MapReduce
▪ MapReduce was the first YARN framework and drove many of YARN’s requirements. It
is integrated tightly with the rest of the Hadoop ecosystem projects, such as Apache Pig,
Apache Hive, and Apache Oozie.
Apache Tez:
▪ Many Hadoopjobs involve the execution of a complex directed acyclic graph (DAG) of
tasks using separate MapReduce stages. Apache Tez generalizes this process and enables
these tasks to be spread across stages so that they can be run as a single,allencompassing
job.
▪ Tez can be used as a MapReduce replacement for projects such as Apache Hive and Apache
Pig. No changes are needed to the Hive or Pig applications.
Apache Giraph
▪ Apache Giraph is an iterative graph processing system built for high scalability.
▪ In addition, using the flexibility of YARN, the Giraph developers plan on implementing
their own web interface to monitor job progress.
▪ A client application creates the persistent configuration files, sets up the HBase cluster
XML files, and then asks YARN to create an ApplicationMaster.
▪ YARN copies all files listed in the client’s application-launch request from HDFS into the
local file system of the chosen server, and then executes the command to start the Hoya
ApplicationMaster.
▪ Hoya also asks YARN for the number of containers matching the number of HBase region
servers it needs.
Dryad on YARN
▪ Similar to Apache Tez, Microsoft’s Dryad provides a DAG as the abstraction of execution
flow. This framework is ported to run natively on YARN and is fully compatible with its
non-YARN version.
▪ The code is written completely in native C++ and C# for worker nodes and uses a thin layer
of Java within the application.
Apache Spark
▪ Spark was initially developed for applications in which keeping data in memory improves
performance, such as iterative algorithms, which are common in machine learning, and
interactive data mining.
▪ Spark differs from classic MapReduce in two important ways.
▪ First, Spark holds intermediate results in memory, rather than writing them to disk.
▪ Second, Spark supports more than just MapReduce functions; that is, it greatly expands
the set of possible analyses that can be executed over HDFS data stores.
Apache Storm
▪ This framework is designed to process unbounded streams of data in real time. It can be
used in any programming language.
▪ The basic Storm use-cases include real-time analytics, online machine learning, continuous
computation, distributed RPC (remote procedure calls), ETL (extract, transform, and load),
and more.
▪ Storm provides fast performance, is scalable, is fault tolerant, and provides processing
guarantees.
▪ It works directly under YARN and takes advantage of the common data and resource
management substrate.