0% found this document useful (0 votes)

45 views

15CS82 Module 2

The document discusses several essential Hadoop tools: - Apache Pig enables writing complex MapReduce transformations using a simple scripting language called Pig Latin. Pig is used for ETL pipelines and quick research on raw data. - Sqoop transfers data between Hadoop and relational databases. It imports data from databases into HDFS and exports data from HDFS to databases. An example demonstrates importing and exporting data from MySQL to HDFS. - Apache Flume collects, transports, and stores streaming data into HDFS. A Flume agent has source, channel, and sink components. Flume agents can be connected in a pipeline to move data across machines. - Apache Hive provides data summarization, ad hoc queries

Uploaded by

Bharathi Umashankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

15CS82 Module 2

Uploaded by

Bharathi Umashankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Big Data and Analytics (15CS82)

Module 2
Essential Hadoop Tools

1 Discuss the usage of Apache pig.

▪ Apache Pig is a high-level language that enables programmers to write complex Map
Reduce transformations using a simple scripting language.
▪ Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers
already familiar with scripting languages and SQL.
▪ Pig Latin (the actual language) defines a set of transformations on a data set such
as aggregate, join, and sort.
▪ Pig is often used to extract, transform, and load (ETL) data pipelines, quick
research on raw data.
▪ Apache Pig has several usage modes. The first is a local mode in which all processing
is done on the local machine. The non-local (cluster) modes are Map Reduce and Tez.
▪ These modes execute the job on the cluster using either the Map Reduce engine or the
optimized Tez engine.
▪ There are also interactive and batch modes available; they enable Pig applications to be
developed locally in interactive modes, using small amounts of data, and then run at
scale on the cluster in a production mode. The modes are in below fig.

Pig Example Walk-Through:

▪ Working knowledge of Pig through the hand-on experience of creating pig scripts to
carry out essential data operations and tasks.
▪ Apache Pig is also installed as part of the Horton works HDP Sandbox.
▪ In this simple example, Pig is used to extract user names from the /etc/passwd file.
▪ The following example assumes the user is hdfs , but any valid user with access
to HDFS can run the example.

Department of CS&E, GMIT, Bharathinagara 1

Big Data and Analytics (15CS82)

▪ To begin first, copy the passwd file to a working directory for local Pig
operation: $cp/etc/passwd.
▪ Next, copy the data file into HDFS for Hadoop Map Reduce operation: $
▪ hdfs dfs –put passwd passwd.
▪ To confirm the file is in HDFS by entering the following command: hdfs dfs –ls
passwd
▪ -rw-r--r- 2 hdfs hdfs 2526 2015-03-17 11:08 passwd.
▪ In local Pig operation, all processing is done on the local machine (Hadoop is not
used). First, the interactive command line started: $ pig -x local.
▪ If Pig starts correctly, you will see a grunt> prompt.
▪ And also see a bunch of INFO messages. Next, enter the commands to load the passwd
file and then grab the user name and dump it to the terminal.
▪ Pig commands must end with a semicolon (;).
▪ grunt> A = load 'passwd’ using Pig Storage (':');
▪ grunt> B = foreach A generate $0 as id;
▪ grunt> dump B;
▪ The processing will start and a list of user names will be printed to the screen.
▪ To exit the interactive session, enter the command quit.
o $ grunt> quit.

2 Explain Apache Sqoop to Acquire Relational data with an example.

▪ Sqoop is a tool designed to transfer data between Hadoop and relational databases.
▪ Sqoop is used to
-import data from a relational database management system
(RDBMS) into the Hadoop Distributed File System (HDFS),
- transform the data in Hadoop and
- export the data back into an RDBMS.

Sqoop import method :

Department of CS&E, GMIT, Bharathinagara 2

Big Data and Analytics (15CS82)

The data import is done in two steps:

1) Sqoop examines the database to gather the necessary metadata for the data to be
imported.

2) Map-only Hadoop job : Transfers the actual data using the metadata.
▪ The imported data are saved in an HDFS directory.
▪ Sqoop will use the database name for the directory, or the user can specify any
alternative directory where the files should be populated. By default, these files contain
comma delimited fields, with new lines separating different records.

Sqoop Export method :

Data export from the cluster works in a similar fashion. The export is done in two steps :

1) examine the database for metadata.

2) Map-only Hadoop job to write the data to the database.
Sqoop divides the input data set into splits, then uses individual map tasks to push the splits to
the database.

Department of CS&E, GMIT, Bharathinagara 3

Big Data and Analytics (15CS82)

Example: The following example shows the use of sqoop:

Steps:

1. Download Sqoop.

2. Download and load sample MySQL data.

3. Add Sqoop user permissions for the local machine and cluster.

4. Import data from MySQL to HDFS. 5. Export data from HDFS to MySQL.
Step 1: Download Sqoop and Load Sample MySQL Database
To install sqoop,
# yum install sqoop sqoop-metastore To download
database,
$ wget http : //downloads.mysql.com/docs/world_innodb.sql.gz
Step 2: Add Sqoop User Permissions for the Local Machine and Cluster.
In MySQL, add the following privileges for user sqoop to MySQL.
mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'limulus' IDENTIFIED
BY 'sqoop';
mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'10.0.0.%'

Department of CS&E, GMIT, Bharathinagara 4

Big Data and Analytics (15CS82)

IDENTIFIED BY 'sqoop'; mysql>

quit
Step 3: Import Data Using Sqoop
To import data, we need to make a directory in HDFS:

$ hdfs dfs -mkdir sqoop-mysql-import

The following command imports the Country table into HDFS. The option -table signifies the
table to import, --target-dir is the directory created previously, and -m 1 tells Sqoop to use one
map task to import the data.
$ sqoop import --connect jdbc:mysql://limulus/world --username sqoop --password
sqoop --table Country -m 1 --target-dir /user/hdfs/sqoopmysql- import/country
The file can be viewed using the hdfs dfs -cat command:

Step 4: Export Data from HDFS to MySQL

Sqoop can also be used to export data from HDFS. The first step is to create tables for exported
data.
Then use the following command to export the cities data into MySQL:
sqoop --options-file cities-export-options.txt --table CityExport -- staging-table
CityExportStaging --clear-staging-table -m 4 --exportdir /user/hdfs/sqoop-mysql-
import/city

3.Discuss Apache Flume to acquire data streams

▪ ApacheFlume is an independent agent designed to collect, transport, and store data into
HDFS.

▪ Data transport involves a number of Flume agents that may traverse a series of machines
and locations.
▪ Flume is often used for log files, social media-generated data, email messages, and just
about any continuous data source.

Department of CS&E, GMIT, Bharathinagara 5

Big Data and Analytics (15CS82)

Figure 7.1Flume agent with source, channel, and sink

▪ Flame agent is composed of three components. o Source: The source component
receives data and sends it to a channel. It can send the data tomore than one channel. o
Channel: A channel is a data queue that forwards the source data to the sink destination.
o Sink: The sink delivers data to destination such as HDFS, a local file, or another
Flume agent.

▪ A Flume agent must have all three of these components defined. Flume agent can have
several source, channels, and sinks.
▪ Source can write to multiple channels, but a sink can take data from only a single
channel.
▪ Data written to a channel remain in the channel until a sink removes the data.

▪ By default, the data in a channel are kept in memory but may be optionally stored on
disk to prevent data loss in the event of a network failure.

Figure 7.2 Pipeline created by connecting Flume agents

Department of CS&E, GMIT, Bharathinagara 6

Big Data and Analytics (15CS82)

▪ As shown in the above figure, Sqoop agents may be placed in a pipeline, possibly to
traverse several machines or domains.

▪ In this Flume pipeline, the sink from one agent is connected to the source of another.
▪ The data transfer normally used by Flume, which is called Apache Avro.

▪ Avro is a data serialization/deserialization system that uses a compact binary format.

▪ The scheme is sent as part of the data exchange and is defined using JSON.

▪ Avro also uses remote procedure calls (RPCs) to send data.

Figure 7.3 A Flume consolidation network.

4 Demonstrate the working of Hive with Hadoop
▪ Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing
data summarization, ad hoc queries, and the analysis of large data sets using a SQL-like
language called HiveQL.
▪ Hive is considered the de facto standard for interactive SQL queries over petabytes of
data using Hadoop.
▪ Some essential features:
▪ Tools to enable easy data extraction, transformation, and loading (ETL)

Department of CS&E, GMIT, Bharathinagara 7

Big Data and Analytics (15CS82)

▪ A mechanism to impose structure on a variety of data formats

▪ Access to files stored either directly in HDFS or in other data storage systems such as
HBase
▪ Query execution via MapReduce and Tez (optimized MapReduce)

▪ Hive is also installed as part of the Hortonworks HDP Sandbox.

▪ To work in Hive with Hadoop, user with access to HDFS can run the Hive queries.

▪ Simply enter the hive command. If Hive start correctly,it get a hive> prompt.
$ hive
(some messages may show up here) hive>

▪ Hive command to create and drop the table. That Hive commands must end with a
semicolon (;).
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 1.705 seconds
▪ To see the table is created,
hive> SHOW TABLES; OK
pokes
Time taken: 0.174 seconds, Fetched: 1 row(s)
▪ To drop the table,
hive> DROP TABLE pokes;
OK
Time taken: 4.038 seconds

▪ The first step is to Creation of table can be developed using a web server log file:
hive> CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6
string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';

▪ Next, to load the data from the sample.log file, the file is found in the local directory
and not in HDFS.
hive> LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO
TABLE logs;
▪ Finally, the select step that this invokes a Hadoop MapReduce operation. The results
appear at the end of the output.
o hive> SELECT t4 AS sev, COUNT(*) AS cnt FROM logs
Department of CS&E, GMIT, Bharathinagara 8
Big Data and Analytics (15CS82)

WHERE t4 LIKE '[%' o GROUP BY t4; o Total jobs = 1 o 2015-03-27

13:00:17,399 Stage-1 map = 0%, reduce = 0% o 2015-03-27 13:00:26,100 Stage-1 map
= 100%, reduce = 0%,
Cumulative CPU 2.14 sec o 2015-03-27 13:00:34,979 Stage-1 map = 100%, reduce
=
100%, Cumulative CPU 4.07 sec o Total MapReduce CPU Time Spent: 4
seconds 70 msec o OK o [DEBUG] 434 o [ERROR] 3 o [FATAL] 1 o [INFO]
96 o [TRACE] 816
o [WARN] 4
o Time taken: 32.624 seconds, Fetched: 6 row(s)
▪ To exit Hive, simply type exit; o hive> exit;

5. Explain yarn application framework with a neat diagram?

YARN presents a resource management platform, which provides services such as scheduling,
fault monitoring, data locality, and more to MapReduce and other frameworks. Below figure
illustrates some of the various frameworks that will run under YARN.

Department of CS&E, GMIT, Bharathinagara 9

Big Data and Analytics (15CS82)

Distributed-Shell
▪ Distributed-Shell is an example application included with the Hadoop core components that
demonstrates how to write applications on top of YARN.
▪ It provides a simple method for running shell commands and scripts in containers in parallel
on a Hadoop YARN cluster.

Hadoop MapReduce
▪ MapReduce was the first YARN framework and drove many of YARN’s requirements. It
is integrated tightly with the rest of the Hadoop ecosystem projects, such as Apache Pig,
Apache Hive, and Apache Oozie.

Apache Tez:

▪ Many Hadoopjobs involve the execution of a complex directed acyclic graph (DAG) of
tasks using separate MapReduce stages. Apache Tez generalizes this process and enables
these tasks to be spread across stages so that they can be run as a single,allencompassing
job.
▪ Tez can be used as a MapReduce replacement for projects such as Apache Hive and Apache
Pig. No changes are needed to the Hive or Pig applications.

Apache Giraph
▪ Apache Giraph is an iterative graph processing system built for high scalability.

▪ Facebook, Twitter, and LinkedIn use it to create social graphs of users.

▪ Giraph was originally written to run on standard Hadoop V1 using the MapReduce
framework, but that approach proved inefficient and totally unnatural for various reasons.
▪ The native Giraph implementation under YARN provides the user with an iterative
processing model that is not directly available with MapReduce.

Department of CS&E, GMIT, Bharathinagara 10

Big Data and Analytics (15CS82)

▪ In addition, using the flexibility of YARN, the Giraph developers plan on implementing
their own web interface to monitor job progress.

Hoya: HBase on YARN

▪ The Hoya project creates dynamic and elastic Apache HBase clusters on top of YARN.

▪ A client application creates the persistent configuration files, sets up the HBase cluster
XML files, and then asks YARN to create an ApplicationMaster.
▪ YARN copies all files listed in the client’s application-launch request from HDFS into the
local file system of the chosen server, and then executes the command to start the Hoya
ApplicationMaster.
▪ Hoya also asks YARN for the number of containers matching the number of HBase region
servers it needs.

Dryad on YARN

▪ Similar to Apache Tez, Microsoft’s Dryad provides a DAG as the abstraction of execution
flow. This framework is ported to run natively on YARN and is fully compatible with its
non-YARN version.

▪ The code is written completely in native C++ and C# for worker nodes and uses a thin layer
of Java within the application.

Apache Spark
▪ Spark was initially developed for applications in which keeping data in memory improves
performance, such as iterative algorithms, which are common in machine learning, and
interactive data mining.
▪ Spark differs from classic MapReduce in two important ways.
▪ First, Spark holds intermediate results in memory, rather than writing them to disk.

▪ Second, Spark supports more than just MapReduce functions; that is, it greatly expands
the set of possible analyses that can be executed over HDFS data stores.

▪ It also provides APIs in Scala, Java, and Python.

Department of CS&E, GMIT, Bharathinagara 11

Big Data and Analytics (15CS82)

Apache Storm
▪ This framework is designed to process unbounded streams of data in real time. It can be
used in any programming language.
▪ The basic Storm use-cases include real-time analytics, online machine learning, continuous
computation, distributed RPC (remote procedure calls), ETL (extract, transform, and load),
and more.
▪ Storm provides fast performance, is scalable, is fault tolerant, and provides processing
guarantees.

▪ It works directly under YARN and takes advantage of the common data and resource
management substrate.

Department of CS&E, GMIT, Bharathinagara 12

Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Barks
100% (1)
Barks
21 pages
MH-Diesel-Generator-Hazard-Assessment - PDF Master Hire PDF
100% (1)
MH-Diesel-Generator-Hazard-Assessment - PDF Master Hire PDF
6 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
Apache - SQOOP and Flume
No ratings yet
Apache - SQOOP and Flume
16 pages
BDA Module 2 PDF
No ratings yet
BDA Module 2 PDF
123 pages
Module 2
No ratings yet
Module 2
27 pages
Data Lake 1
No ratings yet
Data Lake 1
48 pages
Data Ingest
No ratings yet
Data Ingest
15 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
Screenshot 2025-01-13 at 12.17.38 PM
No ratings yet
Screenshot 2025-01-13 at 12.17.38 PM
12 pages
Cse 17CS82 M2 S2 PPT
No ratings yet
Cse 17CS82 M2 S2 PPT
20 pages
Essential Hadoop Tools: Module - 2 Session - 2
No ratings yet
Essential Hadoop Tools: Module - 2 Session - 2
6 pages
Cloudera Academic Partnership 8 PDF
No ratings yet
Cloudera Academic Partnership 8 PDF
69 pages
UNIT-4
No ratings yet
UNIT-4
119 pages
BIG DATA Module 2 FINAL SMI
No ratings yet
BIG DATA Module 2 FINAL SMI
44 pages
m2c1 PDF
No ratings yet
m2c1 PDF
50 pages
BigData Module 2
No ratings yet
BigData Module 2
41 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
SqoopVSFlume
No ratings yet
SqoopVSFlume
18 pages
BDT_Unit04
No ratings yet
BDT_Unit04
136 pages
Fundamentals of Apache Sqoop Notes
No ratings yet
Fundamentals of Apache Sqoop Notes
66 pages
BATCH12
No ratings yet
BATCH12
32 pages
bda u3 copy
No ratings yet
bda u3 copy
59 pages
Bda 11
No ratings yet
Bda 11
10 pages
Hadoop Eco System - Class 1
No ratings yet
Hadoop Eco System - Class 1
25 pages
Module 5_Sqoop
No ratings yet
Module 5_Sqoop
25 pages
Unit 4 3 Lumify,Data Rapper and Sqooop
No ratings yet
Unit 4 3 Lumify,Data Rapper and Sqooop
27 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
Week 4 - PIG SqoopFall2019
No ratings yet
Week 4 - PIG SqoopFall2019
117 pages
Lab 5 Correlate Structured W Unstructured Data
No ratings yet
Lab 5 Correlate Structured W Unstructured Data
5 pages
Big Data: Sqoop
No ratings yet
Big Data: Sqoop
43 pages
SIC Big Data Chapter 3 Workbook
No ratings yet
SIC Big Data Chapter 3 Workbook
86 pages
Sqoop & Flume: Issues With Data Load Into Hadoop
No ratings yet
Sqoop & Flume: Issues With Data Load Into Hadoop
6 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
HADOOP notes unit 3 and 4
No ratings yet
HADOOP notes unit 3 and 4
14 pages
UNIT-2 IMP QUES ANS
No ratings yet
UNIT-2 IMP QUES ANS
8 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
Slide 4 Data Loading Tool
No ratings yet
Slide 4 Data Loading Tool
77 pages
B22 BDA Experiment 03
No ratings yet
B22 BDA Experiment 03
11 pages
M - M - Num-Mappers
No ratings yet
M - M - Num-Mappers
4 pages
big data BASICS
No ratings yet
big data BASICS
3 pages
DMBD MBAA21041 Sqoop
No ratings yet
DMBD MBAA21041 Sqoop
11 pages
Week 4 - Hadoop Ecosystem
No ratings yet
Week 4 - Hadoop Ecosystem
109 pages
Big Data & Hadoop
100% (3)
Big Data & Hadoop
189 pages
Practice Assignment
No ratings yet
Practice Assignment
4 pages
32 BDA Exp2
No ratings yet
32 BDA Exp2
24 pages
sqoopintro
No ratings yet
sqoopintro
2 pages
Cloud Computing Era Practice
No ratings yet
Cloud Computing Era Practice
75 pages
SQOOP
No ratings yet
SQOOP
8 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
intro
No ratings yet
intro
2 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
Unit 6
No ratings yet
Unit 6
26 pages
BDA Lab2
No ratings yet
BDA Lab2
8 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Economic dispatch
No ratings yet
Economic dispatch
12 pages
2MBA Business Plan
No ratings yet
2MBA Business Plan
23 pages
19 Lecture PPT
No ratings yet
19 Lecture PPT
47 pages
Popular Electronics 1963-11
No ratings yet
Popular Electronics 1963-11
118 pages
Lal Isua Hmel: Organ of K - P, Dawrpui Vengthar West
No ratings yet
Lal Isua Hmel: Organ of K - P, Dawrpui Vengthar West
4 pages
Yazu's Resume
No ratings yet
Yazu's Resume
3 pages
Birthday Celebrants - January 2017
No ratings yet
Birthday Celebrants - January 2017
6 pages
Cainta vs. Cainta 489 Scra 468 2006
No ratings yet
Cainta vs. Cainta 489 Scra 468 2006
12 pages
SP70 BakerBrian-libre PDF
100% (2)
SP70 BakerBrian-libre PDF
70 pages
Inside A Designer Jacket
No ratings yet
Inside A Designer Jacket
7 pages
Final_VLSI_Diploma
No ratings yet
Final_VLSI_Diploma
106 pages
02 Example of Student Packet - Annotated (GR 6 - SP11)
No ratings yet
02 Example of Student Packet - Annotated (GR 6 - SP11)
40 pages
ENGLISH34Q2W1LP
No ratings yet
ENGLISH34Q2W1LP
27 pages
Wangi Nisita Naura - Bing Page 65-67
No ratings yet
Wangi Nisita Naura - Bing Page 65-67
3 pages
Assignment Engineering Graphics 1 (1)
No ratings yet
Assignment Engineering Graphics 1 (1)
10 pages
Monster Week 2 Daily Academic Vocabulary Week 2 Week of March 31 2014 S
No ratings yet
Monster Week 2 Daily Academic Vocabulary Week 2 Week of March 31 2014 S
67 pages
A8 Maersk Case
No ratings yet
A8 Maersk Case
5 pages
PDF American Elegy The Poetry of Mourning from the Puritans to Whitman 1st Edition Max Cavitch download
100% (4)
PDF American Elegy The Poetry of Mourning from the Puritans to Whitman 1st Edition Max Cavitch download
62 pages
Class XI: Business Studies (Enterprise)
No ratings yet
Class XI: Business Studies (Enterprise)
9 pages
AP x - Social Studies Final Chapter Wise Revision for Teachers 23-24.Pmd
No ratings yet
AP x - Social Studies Final Chapter Wise Revision for Teachers 23-24.Pmd
16 pages
FM 2016 Tactics Guide
100% (1)
FM 2016 Tactics Guide
33 pages
Common Redshank: 2 Subspecies
No ratings yet
Common Redshank: 2 Subspecies
3 pages
Rajiv Gandhi University of Health Sciences Karnataka Bangalore
No ratings yet
Rajiv Gandhi University of Health Sciences Karnataka Bangalore
13 pages
Cambridge IGCSE: ECONOMICS 0455/21
No ratings yet
Cambridge IGCSE: ECONOMICS 0455/21
8 pages
Section 1: The Ielts Hub
No ratings yet
Section 1: The Ielts Hub
14 pages
Children Lit
No ratings yet
Children Lit
13 pages
TOPIC 4 (Economic and Industry Analysis)
No ratings yet
TOPIC 4 (Economic and Industry Analysis)
28 pages
Mozzozzin Sur Clearance Indigency Certification
No ratings yet
Mozzozzin Sur Clearance Indigency Certification
3 pages

15CS82 Module 2

Uploaded by

15CS82 Module 2

Uploaded by

Big Data and Analytics (15CS82)

1 Discuss the usage of Apache pig.

Pig Example Walk-Through:

Department of CS&E, GMIT, Bharathinagara 1

2 Explain Apache Sqoop to Acquire Relational data with an example.

Sqoop import method :

Department of CS&E, GMIT, Bharathinagara 2

The data import is done in two steps:

Sqoop Export method :

1) examine the database for metadata.

Department of CS&E, GMIT, Bharathinagara 3

Example: The following example shows the use of sqoop:

2. Download and load sample MySQL data.

Department of CS&E, GMIT, Bharathinagara 4

IDENTIFIED BY 'sqoop'; mysql>

$ hdfs dfs -mkdir sqoop-mysql-import

Step 4: Export Data from HDFS to MySQL

3.Discuss Apache Flume to acquire data streams

Department of CS&E, GMIT, Bharathinagara 5

Figure 7.1Flume agent with source, channel, and sink

Figure 7.2 Pipeline created by connecting Flume agents

Department of CS&E, GMIT, Bharathinagara 6

▪ Avro is a data serialization/deserialization system that uses a compact binary format.

▪ Avro also uses remote procedure calls (RPCs) to send data.

Figure 7.3 A Flume consolidation network.

Department of CS&E, GMIT, Bharathinagara 7

▪ A mechanism to impose structure on a variety of data formats

▪ Hive is also installed as part of the Hortonworks HDP Sandbox.

WHERE t4 LIKE '[%' o GROUP BY t4; o Total jobs = 1 o 2015-03-27

5. Explain yarn application framework with a neat diagram?

Department of CS&E, GMIT, Bharathinagara 9

▪ Facebook, Twitter, and LinkedIn use it to create social graphs of users.

Department of CS&E, GMIT, Bharathinagara 10

Hoya: HBase on YARN

▪ It also provides APIs in Scala, Java, and Python.

Department of CS&E, GMIT, Bharathinagara 11

Department of CS&E, GMIT, Bharathinagara 12

You might also like