0% found this document useful (0 votes)
38 views35 pages

Unit 3 Topic 8 Flume and Scoop

The document provides an overview of Apache Sqoop and Apache Flume, two tools used in the Hadoop ecosystem for data transfer. Sqoop is designed for importing and exporting data between relational databases and Hadoop, while Flume is focused on streaming log data into Hadoop. Both tools have distinct architectures, advantages, and limitations, making them suitable for different data ingestion scenarios.

Uploaded by

gauravtele1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views35 pages

Unit 3 Topic 8 Flume and Scoop

The document provides an overview of Apache Sqoop and Apache Flume, two tools used in the Hadoop ecosystem for data transfer. Sqoop is designed for importing and exporting data between relational databases and Hadoop, while Flume is focused on streaming log data into Hadoop. Both tools have distinct architectures, advantages, and limitations, making them suitable for different data ingestion scenarios.

Uploaded by

gauravtele1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

SQoop & Flume

ABES EC, Ghaziabad


Affiliated to Dr. A.P.J. Abdul Kalam
Technical University, Uttar Pradesh,
Lucknow
Sqoop
 Apache Sqoop (SQL-to-Hadoop) is a lifesaver for
anyone who is experiencing difficulties in moving
data from the data warehouse into the Hadoop
environment.
 Apache Sqoop is an effective Hadoop tool used for
importing data from RDBMS’s like MySQL, Oracle,
etc. into HBase, Hive or HDFS.
 Sqoop Hadoop can also be used for exporting data
from HDFS into RDBMS.
 Apache Sqoop is a command-line interpreter i.e.
the Sqoop commands are executed one at a time
Conti…
It is used to transfer the bulk of data between
HDFS and Relational Database Servers.
It is used to import the data from RDBMS to
Hadoop and export the data from Hadoop to
RDBMS.
It uses Map Reduce for its import and export
operation.
It also uses a command-line argument for its
import and export procedure.
It also supports the Linux Operating System.
Conti…
Itis not a cluster service.
The name “SQOOP” came from ‘SQL’ and
‘Hadoop’, means that the first two letters of
Sqoop, i.e. “SQ”, came from SQL, and the last
three letters, i.e. “OOP”, came from Hadoop.
Sqoop always requires “JDBC” and
“Connectors”. Here JDBC, i.e. MySQL, Oracle,
etc. and Connectors such as Oraoop or
Cloudera.
For the installation of Sqoop, you need a
Sqoop Architecture
Conti…
Multiple mappers perform map tasks to load the data
on to HDFS.
Sqoop Import
Sqoop Export
Steps to Complete the Sqoop Action
Step 1:
 It sends the request to RDBMS to send the return of the
metadata information about the table (Metadata here is the
data about the data).

Step 2:
 From the received information, it will generate the java
classes (Uses JDBC and Connectors).

Step 3:
 Now, Sqoop, post compiling, creates jar file(Java packaging
standard), which will be helpful to use the data for our own
verification.
Need of Sqoop
 Apache Sqoop can handle the full load by just a single
command which we can call it a Sqoop with full load
power.
 It also has incremental load power; you can just load the
path of the table where it is updated.
 It uses the Yarn framework to import and export the data,
which provides fault tolerance on top of parallelism.
 You can compress your data by specifying the
compression code argument; in short, Sqoop is used as a
compression also.
 It is the best intermediate between the RDBMS and
Hadoop.
 It is simple to understand and has easy to go structure.
How Sqoop Works
 Sqoop import command imports a table from an
RDBMS to HDFS; each record from an RDBMS table
is considered as a separate record in HDFS.
Records can be stored as text files, and the same
results we will get from HDFS, and we can get the
outcome in RDBMS format, and the process is
called it as export a table.

 It sends the request to Relational DB to send the


return of the metadata information about the table
(Metadata here is the data about the table in
relational DB).
Conti…
Sqoop job creates and saves the import and
export commands for its processing to get a
better outcome, providing us with accurate
results.
It specifies parameters to identify and recall the
saved job, which helps to create the point to point
relevant results.
Re-calling or Re-executing is used in the
incremental import, which can import the updated
rows from the RDBMS table to HDFS and vice
versa means that HDFS to RDBMS table and that
Sqoop Import
 You have a database table with an INTEGER primary key.
 You are only appending new rows, and you need to
periodically sync the table’s state to Hadoop for further
processing.
 Activate Sqoop’s incremental feature by specifying the –
incremental parameter.
 The parameter’s value will be the type of incremental
import. When your table is only getting new rows, and the
existing ones are not changed, use the append mode.
 Sqoop import command imports a table from an
RDBMS to HDFS. Each record from a table is considered as
a separate record in HDFS. Records can be stored as text
files or in binary representation as Avro or SequenceFiles.
Conti…
The Sqoop import tool imports the individual
tables from Relational Databases to Hadoop
Distributed File System. Each row of a table in
RDBMS is treated as a record in the HDFS.
All these records are stored as text data in
the text files or as the binary data in the Avro
and Sequence files.
Conti….
Sqoop import --connect --table --username --
password --target-dir

sqoop import \
--connect
jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table visits \
--incremental append \
--check-column id \
Conti….
For importing selected data from a table.
sqoop import --connect --table --username
--password --columns –where

For importing data from Query.


sqoop import --connect --table --username
-- password –query
Sqoop Export
 Thisis the query that we are using in sqoop export given
below:
sqoop-export--connect --username –password --
export-dir

 The Sqoop export tool exports the set of files from the
Hadoop Distributed File System back to the Relational
Database. The files which are given as an input to the
Sqoop contain the records.
 These records are called as rows in a table. Those are
read and parsed into the set of records and are delimited
Advantages of Sqoop
Sqoop allows data transfer with the different
structured data stores such as Teradata,
Postgres, Oracle, and so on.
Sqoop executes data transfer in parallel, so
its execution is quick and cost-effective.
Sqoop helps in integration with the sequential
data from the mainframe. This helps reduce
high costs in executing specific jobs using
mainframe hardware.
Limitations of Sqoop
 We cannot pause or resume the Sqoop once it is started. It is
an automatic step. If in case it fails, then we have to clear
the things and start it again.
 Performance of Sqoop Export depends on hardware
configuration such as Memory, Hard disk of the RDBMS
server.
 It is slow because it uses MapReduce in backend processing.
 Failures need special handling in the case of partial export or
import.
 Sqoop 1 uses a JDBC connection for connecting with RDBMS.
This can be less performance and inefficient.
 Sqoop 1 does not provide a Graphical User Interface for easy
use.
Flume
 Apache Flume is a service designed for streaming logs
into the Hadoop environment.
 Apache Flume is an open-source, powerful, reliable
and flexible system used to collect, aggregate and
move large amounts of unstructured data from
multiple data sources into HDFS/Hbase in a distributed
fashion via it's strong coupling with the Hadoop
cluster.
 Apache Flume is a tool/service/data ingestion
mechanism for collecting aggregating and transporting
large amounts of streaming data such as log data,
events (etc...) from various webserves to a centralized
Flume
Flume
Conti…
Conti…
 In the above diagram, the events generated by external
source (WebServer) are consumed by Flume Data Source.
The external source sends events to Flume source in a
format that is recognized by the target source.
 Flume Source receives an event and stores it into one or
more channels. The channel acts as a store which keeps
the event until it is consumed by the flume sink. This
channel may use a local file system in order to store these
events.
 Flume sink removes the event from a channel and stores it
into an external repository like e.g., HDFS. There could be
multiple flume agents, in which case flume sink forwards
the event to the flume source of next flume agent in the
Conti…
Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating,
and moving large amounts of log data.
It has a simple and flexible architecture
based on streaming data flows.
It is robust and fault tolerant with tunable
reliability mechanisms and many failover and
recovery mechanisms.
It uses a simple extensible data model that
allows for online analytic application.
Flume Applications
Apache Flume is used by e-commerce companies to
analyze customer behavior from a particular region.
We can use Apache Flume to move huge amounts
of data generated by application servers into the
Hadoop Distributed File System at a higher speed.
Apache Flume is used for fraud detections.
We can use Apache Flume in IoT applications.
Apache Flume can be used for aggregating machine
and sensor-generated data.
We can use Apache Flume in the alerting or SIEM.
Important features of FLUME
Flume has a flexible design based upon
streaming data flows.
It is fault tolerant and robust with multiple
failovers and recovery mechanisms.
Flume Big data has different levels of reliability
to offer which includes ‘best-effort
delivery’ and an ‘end-to-end
delivery’. Best-effort delivery does not
tolerate any Flume node failure whereas ‘end-
to-end delivery’ mode guarantees delivery
even in the event of multiple node failures.
Conti…
Flume carries data between sources and sinks.
This gathering of data can either be scheduled or
event-driven. Flume has its own query processing
engine which makes it easy to transform each
new batch of data before it is moved to the
intended sink.
Possible Flume sinks include HDFS and HBase.
Flume Hadoop can also be used to transport
event data including but not limited to network
traffic data, data generated by social media
websites and email messages.
Features of Flume
 Apache Flume is a robust, fault-tolerant, and highly available
service.
 It is a distributed system with tunable reliability mechanisms for fail-
over and recovery.
 Apache Flume is horizontally scalable.
 Apache Flume supports complex data flows such as multi-hop flows,
fan-in flows, fan-out flows. Contextual routing etc.
 Apache Flume provides support for large sets of sources, channels,
and sinks.
 Apache Flume can efficiently ingest log data from various servers
into a centralized repository.
 With Flume, we can collect data from different web servers in real-
time as well as in batch mode.
 We can import large volumes of data generated by social networking
sites and e-commerce sites into Hadoop DFS using Apache Flume.
Flume Advantages
Apache Flume enables us to store streaming
data into any of the centralized repositories
(such as HBase, HDFS).
Flume provides steady data flow between
producer and consumer during reading2/write
operations.
Flume supports the feature of contextual routing.
Apache Flume guarantees reliable message
delivery.
Flume is reliable, scalable, extensible, fault-
tolerant, manageable, and customizable.
Flume Disadvantages
Apache Flume offers weaker ordering
guarantees.
Apache Flume does not guarantee that the
messages reaching are 100% unique.
It has complex topology and reconfiguration
is challenging.
Apache Flume may suffer from scalability and
reliability issues.
Flume vs Sqoop vs HDFS in Hadoop
Flume Sqoop HDFS

Apache Sqoop is designed


Apache Flume is designed for HDFS is the distributed file
for importing data from
moving bulkier streaming system used by Apache
relational databases to
data into the HDFS. Hadoop for data storing.
HDFS.

It has an agent-based It has a connector based


It has a distributed
architecture. In Flume, the architecture. A Connector will
architecture. The data is
code is written (called as know how to connect to the
distributed across
‘agent’) that takes care of data source and how to fetch
commodity hardware.
the data fetching. the data.

In Flume, the data flows via


HDFS is the destination for HDFS is a final destination
zero or more channels to the
importing data using Sqoop. for data storage.
HDFS.
Conti...
Flume Sqoop HDFS

The Apache Flume data load The Apache Sqoop data load It just stores the data
is driven by an event. is not event-driven. provided by any means.

For importing data from the


For loading streaming data
structured data sources we
like web servers log files or
have, to use Sqoop only HDFS has built-in shell
tweets generated on Twitter,
because Sqoop connectors commands for storing data
we have to use Flume
know how to interact with into it. It cannot import
because flume agents were
the structured data sources streaming data.
designed for fetching
and how to fetch data from
streaming data.
them.
THANK
YOU

You might also like