Unit 3 Topic 8 Flume and Scoop

The document provides an overview of Apache Sqoop and Apache Flume, two tools used in the Hadoop ecosystem for data transfer. Sqoop is designed for importing and exporting data between relational databases and Hadoop, while Flume is focused on streaming log data into Hadoop. Both tools have distinct architectures, advantages, and limitations, making them suitable for different data ingestion scenarios.

Uploaded by

gauravtele1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views35 pages

Unit 3 Topic 8 Flume and Scoop

Uploaded by

gauravtele1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

SQoop & Flume

ABES EC, Ghaziabad

Affiliated to Dr. A.P.J. Abdul Kalam
Technical University, Uttar Pradesh,
Lucknow
Sqoop
 Apache Sqoop (SQL-to-Hadoop) is a lifesaver for
anyone who is experiencing difficulties in moving
data from the data warehouse into the Hadoop
environment.
 Apache Sqoop is an effective Hadoop tool used for
importing data from RDBMS’s like MySQL, Oracle,
etc. into HBase, Hive or HDFS.
 Sqoop Hadoop can also be used for exporting data
from HDFS into RDBMS.
 Apache Sqoop is a command-line interpreter i.e.
the Sqoop commands are executed one at a time
Conti…
It is used to transfer the bulk of data between
HDFS and Relational Database Servers.
It is used to import the data from RDBMS to
Hadoop and export the data from Hadoop to
RDBMS.
It uses Map Reduce for its import and export
operation.
It also uses a command-line argument for its
import and export procedure.
It also supports the Linux Operating System.
Conti…
Itis not a cluster service.
The name “SQOOP” came from ‘SQL’ and
‘Hadoop’, means that the first two letters of
Sqoop, i.e. “SQ”, came from SQL, and the last
three letters, i.e. “OOP”, came from Hadoop.
Sqoop always requires “JDBC” and
“Connectors”. Here JDBC, i.e. MySQL, Oracle,
etc. and Connectors such as Oraoop or
Cloudera.
For the installation of Sqoop, you need a
Sqoop Architecture
Conti…
Multiple mappers perform map tasks to load the data
on to HDFS.
Sqoop Import
Sqoop Export
Steps to Complete the Sqoop Action
Step 1:
 It sends the request to RDBMS to send the return of the
metadata information about the table (Metadata here is the
data about the data).

Step 2:
 From the received information, it will generate the java
classes (Uses JDBC and Connectors).

Step 3:
 Now, Sqoop, post compiling, creates jar file(Java packaging
standard), which will be helpful to use the data for our own
verification.
Need of Sqoop
 Apache Sqoop can handle the full load by just a single
command which we can call it a Sqoop with full load
power.
 It also has incremental load power; you can just load the
path of the table where it is updated.
 It uses the Yarn framework to import and export the data,
which provides fault tolerance on top of parallelism.
 You can compress your data by specifying the
compression code argument; in short, Sqoop is used as a
compression also.
 It is the best intermediate between the RDBMS and
Hadoop.
 It is simple to understand and has easy to go structure.
How Sqoop Works
 Sqoop import command imports a table from an
RDBMS to HDFS; each record from an RDBMS table
is considered as a separate record in HDFS.
Records can be stored as text files, and the same
results we will get from HDFS, and we can get the
outcome in RDBMS format, and the process is
called it as export a table.

 It sends the request to Relational DB to send the

return of the metadata information about the table
(Metadata here is the data about the table in
relational DB).
Conti…
Sqoop job creates and saves the import and
export commands for its processing to get a
better outcome, providing us with accurate
results.
It specifies parameters to identify and recall the
saved job, which helps to create the point to point
relevant results.
Re-calling or Re-executing is used in the
incremental import, which can import the updated
rows from the RDBMS table to HDFS and vice
versa means that HDFS to RDBMS table and that
Sqoop Import
 You have a database table with an INTEGER primary key.
 You are only appending new rows, and you need to
periodically sync the table’s state to Hadoop for further
processing.
 Activate Sqoop’s incremental feature by specifying the –
incremental parameter.
 The parameter’s value will be the type of incremental
import. When your table is only getting new rows, and the
existing ones are not changed, use the append mode.
 Sqoop import command imports a table from an
RDBMS to HDFS. Each record from a table is considered as
a separate record in HDFS. Records can be stored as text
files or in binary representation as Avro or SequenceFiles.
Conti…
The Sqoop import tool imports the individual
tables from Relational Databases to Hadoop
Distributed File System. Each row of a table in
RDBMS is treated as a record in the HDFS.
All these records are stored as text data in
the text files or as the binary data in the Avro
and Sequence files.
Conti….
Sqoop import --connect --table --username --
password --target-dir

sqoop import \
--connect
jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table visits \
--incremental append \
--check-column id \
Conti….
For importing selected data from a table.
sqoop import --connect --table --username
--password --columns –where

For importing data from Query.

sqoop import --connect --table --username
-- password –query
Sqoop Export
 Thisis the query that we are using in sqoop export given
below:
sqoop-export--connect --username –password --
export-dir

 The Sqoop export tool exports the set of files from the
Hadoop Distributed File System back to the Relational
Database. The files which are given as an input to the
Sqoop contain the records.
 These records are called as rows in a table. Those are
read and parsed into the set of records and are delimited
Advantages of Sqoop
Sqoop allows data transfer with the different
structured data stores such as Teradata,
Postgres, Oracle, and so on.
Sqoop executes data transfer in parallel, so
its execution is quick and cost-effective.
Sqoop helps in integration with the sequential
data from the mainframe. This helps reduce
high costs in executing specific jobs using
mainframe hardware.
Limitations of Sqoop
 We cannot pause or resume the Sqoop once it is started. It is
an automatic step. If in case it fails, then we have to clear
the things and start it again.
 Performance of Sqoop Export depends on hardware
configuration such as Memory, Hard disk of the RDBMS
server.
 It is slow because it uses MapReduce in backend processing.
 Failures need special handling in the case of partial export or
import.
 Sqoop 1 uses a JDBC connection for connecting with RDBMS.
This can be less performance and inefficient.
 Sqoop 1 does not provide a Graphical User Interface for easy
use.
Flume
 Apache Flume is a service designed for streaming logs
into the Hadoop environment.
 Apache Flume is an open-source, powerful, reliable
and flexible system used to collect, aggregate and
move large amounts of unstructured data from
multiple data sources into HDFS/Hbase in a distributed
fashion via it's strong coupling with the Hadoop
cluster.
 Apache Flume is a tool/service/data ingestion
mechanism for collecting aggregating and transporting
large amounts of streaming data such as log data,
events (etc...) from various webserves to a centralized
Flume
Flume
Conti…
Conti…
 In the above diagram, the events generated by external
source (WebServer) are consumed by Flume Data Source.
The external source sends events to Flume source in a
format that is recognized by the target source.
 Flume Source receives an event and stores it into one or
more channels. The channel acts as a store which keeps
the event until it is consumed by the flume sink. This
channel may use a local file system in order to store these
events.
 Flume sink removes the event from a channel and stores it
into an external repository like e.g., HDFS. There could be
multiple flume agents, in which case flume sink forwards
the event to the flume source of next flume agent in the
Conti…
Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating,
and moving large amounts of log data.
It has a simple and flexible architecture
based on streaming data flows.
It is robust and fault tolerant with tunable
reliability mechanisms and many failover and
recovery mechanisms.
It uses a simple extensible data model that
allows for online analytic application.
Flume Applications
Apache Flume is used by e-commerce companies to
analyze customer behavior from a particular region.
We can use Apache Flume to move huge amounts
of data generated by application servers into the
Hadoop Distributed File System at a higher speed.
Apache Flume is used for fraud detections.
We can use Apache Flume in IoT applications.
Apache Flume can be used for aggregating machine
and sensor-generated data.
We can use Apache Flume in the alerting or SIEM.
Important features of FLUME
Flume has a flexible design based upon
streaming data flows.
It is fault tolerant and robust with multiple
failovers and recovery mechanisms.
Flume Big data has different levels of reliability
to offer which includes ‘best-effort
delivery’ and an ‘end-to-end
delivery’. Best-effort delivery does not
tolerate any Flume node failure whereas ‘end-
to-end delivery’ mode guarantees delivery
even in the event of multiple node failures.
Conti…
Flume carries data between sources and sinks.
This gathering of data can either be scheduled or
event-driven. Flume has its own query processing
engine which makes it easy to transform each
new batch of data before it is moved to the
intended sink.
Possible Flume sinks include HDFS and HBase.
Flume Hadoop can also be used to transport
event data including but not limited to network
traffic data, data generated by social media
websites and email messages.
Features of Flume
 Apache Flume is a robust, fault-tolerant, and highly available
service.
 It is a distributed system with tunable reliability mechanisms for fail-
over and recovery.
 Apache Flume is horizontally scalable.
 Apache Flume supports complex data flows such as multi-hop flows,
fan-in flows, fan-out flows. Contextual routing etc.
 Apache Flume provides support for large sets of sources, channels,
and sinks.
 Apache Flume can efficiently ingest log data from various servers
into a centralized repository.
 With Flume, we can collect data from different web servers in real-
time as well as in batch mode.
 We can import large volumes of data generated by social networking
sites and e-commerce sites into Hadoop DFS using Apache Flume.
Flume Advantages
Apache Flume enables us to store streaming
data into any of the centralized repositories
(such as HBase, HDFS).
Flume provides steady data flow between
producer and consumer during reading2/write
operations.
Flume supports the feature of contextual routing.
Apache Flume guarantees reliable message
delivery.
Flume is reliable, scalable, extensible, fault-
tolerant, manageable, and customizable.
Flume Disadvantages
Apache Flume offers weaker ordering
guarantees.
Apache Flume does not guarantee that the
messages reaching are 100% unique.
It has complex topology and reconfiguration
is challenging.
Apache Flume may suffer from scalability and
reliability issues.
Flume vs Sqoop vs HDFS in Hadoop
Flume Sqoop HDFS

Apache Sqoop is designed

Apache Flume is designed for HDFS is the distributed file
for importing data from
moving bulkier streaming system used by Apache
relational databases to
data into the HDFS. Hadoop for data storing.
HDFS.

It has an agent-based It has a connector based

It has a distributed
architecture. In Flume, the architecture. A Connector will
architecture. The data is
code is written (called as know how to connect to the
distributed across
‘agent’) that takes care of data source and how to fetch
commodity hardware.
the data fetching. the data.

In Flume, the data flows via

HDFS is the destination for HDFS is a final destination
zero or more channels to the
importing data using Sqoop. for data storage.
HDFS.
Conti...
Flume Sqoop HDFS

The Apache Flume data load The Apache Sqoop data load It just stores the data
is driven by an event. is not event-driven. provided by any means.

For importing data from the

For loading streaming data
structured data sources we
like web servers log files or
have, to use Sqoop only HDFS has built-in shell
tweets generated on Twitter,
because Sqoop connectors commands for storing data
we have to use Flume
know how to interact with into it. It cannot import
because flume agents were
the structured data sources streaming data.
designed for fetching
and how to fetch data from
streaming data.
them.
THANK
YOU

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Module 5 - Sqoop
No ratings yet
Module 5 - Sqoop
25 pages
Unit 3 Apache Sqoop and Drill
No ratings yet
Unit 3 Apache Sqoop and Drill
10 pages
Bda U3
No ratings yet
Bda U3
59 pages
Essential Hadoop Tools: Module - 2 Session - 2
No ratings yet
Essential Hadoop Tools: Module - 2 Session - 2
6 pages
SQOOP
No ratings yet
SQOOP
8 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
Fundamentals of Apache Sqoop Notes
No ratings yet
Fundamentals of Apache Sqoop Notes
66 pages
B22 BDA Experiment 03
No ratings yet
B22 BDA Experiment 03
11 pages
Cse 17CS82 M2 S2 PPT
No ratings yet
Cse 17CS82 M2 S2 PPT
20 pages
Cloudera Academic Partnership 8 PDF
No ratings yet
Cloudera Academic Partnership 8 PDF
69 pages
DSCI 5350 - Lecture 3 PDF
No ratings yet
DSCI 5350 - Lecture 3 PDF
39 pages
Practice Assignment
No ratings yet
Practice Assignment
3 pages
Module 2
No ratings yet
Module 2
27 pages
15CS82 Module 2
No ratings yet
15CS82 Module 2
12 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
Unit 4 3 Lumify, Data Rapper and Sqooop
No ratings yet
Unit 4 3 Lumify, Data Rapper and Sqooop
27 pages
Practice Assignment
No ratings yet
Practice Assignment
4 pages
U Iv Sqoop 1
No ratings yet
U Iv Sqoop 1
20 pages
Unit 6
No ratings yet
Unit 6
26 pages
BDA Lab2
No ratings yet
BDA Lab2
8 pages
Sqoop
No ratings yet
Sqoop
28 pages
160 P16cse5a-P16ite3a 2020052411232116
No ratings yet
160 P16cse5a-P16ite3a 2020052411232116
13 pages
Az 3
No ratings yet
Az 3
19 pages
Sqoop - A Haddop Technology: Srikalahasti
No ratings yet
Sqoop - A Haddop Technology: Srikalahasti
13 pages
Chapter n3 Sqoop
No ratings yet
Chapter n3 Sqoop
24 pages
Bda 11
No ratings yet
Bda 11
10 pages
DMBD MBAA21041 Sqoop
No ratings yet
DMBD MBAA21041 Sqoop
11 pages
Apache - SQOOP and Flume
No ratings yet
Apache - SQOOP and Flume
16 pages
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
No ratings yet
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
7 pages
Unit 3 Part 2 Scoopflume
No ratings yet
Unit 3 Part 2 Scoopflume
10 pages
Big Data: Sqoop
No ratings yet
Big Data: Sqoop
43 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
90 pages
Experiment-5 (Case Study On Sqoop)
No ratings yet
Experiment-5 (Case Study On Sqoop)
5 pages
04 Sqoop
No ratings yet
04 Sqoop
30 pages
Sqoop VSFlume
No ratings yet
Sqoop VSFlume
18 pages
SQOOP
No ratings yet
SQOOP
6 pages
Module IV
No ratings yet
Module IV
5 pages
Sqoop
No ratings yet
Sqoop
4 pages
Unit 4
No ratings yet
Unit 4
119 pages
22241A66C5 Assignment21
No ratings yet
22241A66C5 Assignment21
16 pages
Intro
No ratings yet
Intro
2 pages
Sqoopintro
No ratings yet
Sqoopintro
2 pages
Scoop Intro
No ratings yet
Scoop Intro
9 pages
BD Sqltohadoop3 PDF
No ratings yet
BD Sqltohadoop3 PDF
13 pages
BDA Module 2 PDF
No ratings yet
BDA Module 2 PDF
123 pages
Apache Sqoop Data Transfer Between Hadoop and RDBMS
No ratings yet
Apache Sqoop Data Transfer Between Hadoop and RDBMS
9 pages
Data Ingest
No ratings yet
Data Ingest
15 pages
Apache Sqoop: Vasanth B 2019202060
No ratings yet
Apache Sqoop: Vasanth B 2019202060
10 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Apache Sqoop: Hanoi - Autumn 2019
No ratings yet
Apache Sqoop: Hanoi - Autumn 2019
18 pages
Gold Video Task Complted
No ratings yet
Gold Video Task Complted
31 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
6 pages
Data Lake 1
No ratings yet
Data Lake 1
48 pages
6.moving Data Into Hadoop
No ratings yet
6.moving Data Into Hadoop
18 pages
What Are The Components of Web Service?: Java Questions
No ratings yet
What Are The Components of Web Service?: Java Questions
9 pages
BigData - Sem 4 - Elective 1 - Module 2 - PPT
No ratings yet
BigData - Sem 4 - Elective 1 - Module 2 - PPT
29 pages
Scoop PPT
No ratings yet
Scoop PPT
3 pages
Unit-2 Imp Ques Ans
No ratings yet
Unit-2 Imp Ques Ans
8 pages
Sqoop & Flume: Issues With Data Load Into Hadoop
No ratings yet
Sqoop & Flume: Issues With Data Load Into Hadoop
6 pages
Top 50 Database Interview Questions
No ratings yet
Top 50 Database Interview Questions
10 pages
SAP AI Core Architecture and Objects For Beginners - SAP Community
No ratings yet
SAP AI Core Architecture and Objects For Beginners - SAP Community
7 pages
DBMS Unit 8
No ratings yet
DBMS Unit 8
7 pages
How Google Indexing Works
No ratings yet
How Google Indexing Works
3 pages
Unit-2 Dbms
No ratings yet
Unit-2 Dbms
74 pages
Ip 065
No ratings yet
Ip 065
5 pages
XII Practical File CS
No ratings yet
XII Practical File CS
36 pages
Build Data Warehouse With SQLite
No ratings yet
Build Data Warehouse With SQLite
5 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
2014 Midterm Answer
No ratings yet
2014 Midterm Answer
8 pages
Salesforce CPQ Developer Guide: Version 54.0, Spring '22
No ratings yet
Salesforce CPQ Developer Guide: Version 54.0, Spring '22
107 pages
(SM) Chapter 4
No ratings yet
(SM) Chapter 4
19 pages
Practice 32 - Using Table Compression
No ratings yet
Practice 32 - Using Table Compression
9 pages
DBMS Assignment 2
No ratings yet
DBMS Assignment 2
4 pages
Monitoring and Supporting Data Conversion
No ratings yet
Monitoring and Supporting Data Conversion
4 pages
0 KDLVLP Đã G P
No ratings yet
0 KDLVLP Đã G P
523 pages
Practical: 1: Aim: Create Chat Application Using TCP & UDP Protocol. Connection On Both Sides Should
No ratings yet
Practical: 1: Aim: Create Chat Application Using TCP & UDP Protocol. Connection On Both Sides Should
37 pages
Oracle 8i DBA Bible
No ratings yet
Oracle 8i DBA Bible
1,040 pages
1day Before
No ratings yet
1day Before
27 pages
Network (CODASYL) Data Model
100% (3)
Network (CODASYL) Data Model
47 pages
Basic Linux Commands Cheat-Sheet
No ratings yet
Basic Linux Commands Cheat-Sheet
2 pages
Oracle 1z0 909 Dumps by Hogan 22-07-2024 10qa Actualtestdumps
No ratings yet
Oracle 1z0 909 Dumps by Hogan 22-07-2024 10qa Actualtestdumps
17 pages
Chapter 21: Object Database Standards, Languages, and Design
No ratings yet
Chapter 21: Object Database Standards, Languages, and Design
3 pages
Data Warehouse Data Mining Lecture Plan
No ratings yet
Data Warehouse Data Mining Lecture Plan
1 page
Case Sensitive Vlookup in Excel Finding The 1st, 2nd, NTH or Last Occurrence of The Lookup Value
No ratings yet
Case Sensitive Vlookup in Excel Finding The 1st, 2nd, NTH or Last Occurrence of The Lookup Value
8 pages
Normalisation Seminar Report
100% (1)
Normalisation Seminar Report
29 pages
Col362 HW1
No ratings yet
Col362 HW1
4 pages
CS8481 - Set2
No ratings yet
CS8481 - Set2
5 pages
ADM Pranit Micro
100% (1)
ADM Pranit Micro
28 pages
MS SQL - User List Query
No ratings yet
MS SQL - User List Query
3 pages

Unit 3 Topic 8 Flume and Scoop

Uploaded by

Unit 3 Topic 8 Flume and Scoop

Uploaded by

SQoop & Flume

ABES EC, Ghaziabad

 It sends the request to Relational DB to send the

For importing data from Query.

Apache Sqoop is designed

It has an agent-based It has a connector based

In Flume, the data flows via

For importing data from the

You might also like