Unit 3 Topic 8 Flume and Scoop
Unit 3 Topic 8 Flume and Scoop
Step 2:
From the received information, it will generate the java
classes (Uses JDBC and Connectors).
Step 3:
Now, Sqoop, post compiling, creates jar file(Java packaging
standard), which will be helpful to use the data for our own
verification.
Need of Sqoop
Apache Sqoop can handle the full load by just a single
command which we can call it a Sqoop with full load
power.
It also has incremental load power; you can just load the
path of the table where it is updated.
It uses the Yarn framework to import and export the data,
which provides fault tolerance on top of parallelism.
You can compress your data by specifying the
compression code argument; in short, Sqoop is used as a
compression also.
It is the best intermediate between the RDBMS and
Hadoop.
It is simple to understand and has easy to go structure.
How Sqoop Works
Sqoop import command imports a table from an
RDBMS to HDFS; each record from an RDBMS table
is considered as a separate record in HDFS.
Records can be stored as text files, and the same
results we will get from HDFS, and we can get the
outcome in RDBMS format, and the process is
called it as export a table.
sqoop import \
--connect
jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table visits \
--incremental append \
--check-column id \
Conti….
For importing selected data from a table.
sqoop import --connect --table --username
--password --columns –where
The Sqoop export tool exports the set of files from the
Hadoop Distributed File System back to the Relational
Database. The files which are given as an input to the
Sqoop contain the records.
These records are called as rows in a table. Those are
read and parsed into the set of records and are delimited
Advantages of Sqoop
Sqoop allows data transfer with the different
structured data stores such as Teradata,
Postgres, Oracle, and so on.
Sqoop executes data transfer in parallel, so
its execution is quick and cost-effective.
Sqoop helps in integration with the sequential
data from the mainframe. This helps reduce
high costs in executing specific jobs using
mainframe hardware.
Limitations of Sqoop
We cannot pause or resume the Sqoop once it is started. It is
an automatic step. If in case it fails, then we have to clear
the things and start it again.
Performance of Sqoop Export depends on hardware
configuration such as Memory, Hard disk of the RDBMS
server.
It is slow because it uses MapReduce in backend processing.
Failures need special handling in the case of partial export or
import.
Sqoop 1 uses a JDBC connection for connecting with RDBMS.
This can be less performance and inefficient.
Sqoop 1 does not provide a Graphical User Interface for easy
use.
Flume
Apache Flume is a service designed for streaming logs
into the Hadoop environment.
Apache Flume is an open-source, powerful, reliable
and flexible system used to collect, aggregate and
move large amounts of unstructured data from
multiple data sources into HDFS/Hbase in a distributed
fashion via it's strong coupling with the Hadoop
cluster.
Apache Flume is a tool/service/data ingestion
mechanism for collecting aggregating and transporting
large amounts of streaming data such as log data,
events (etc...) from various webserves to a centralized
Flume
Flume
Conti…
Conti…
In the above diagram, the events generated by external
source (WebServer) are consumed by Flume Data Source.
The external source sends events to Flume source in a
format that is recognized by the target source.
Flume Source receives an event and stores it into one or
more channels. The channel acts as a store which keeps
the event until it is consumed by the flume sink. This
channel may use a local file system in order to store these
events.
Flume sink removes the event from a channel and stores it
into an external repository like e.g., HDFS. There could be
multiple flume agents, in which case flume sink forwards
the event to the flume source of next flume agent in the
Conti…
Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating,
and moving large amounts of log data.
It has a simple and flexible architecture
based on streaming data flows.
It is robust and fault tolerant with tunable
reliability mechanisms and many failover and
recovery mechanisms.
It uses a simple extensible data model that
allows for online analytic application.
Flume Applications
Apache Flume is used by e-commerce companies to
analyze customer behavior from a particular region.
We can use Apache Flume to move huge amounts
of data generated by application servers into the
Hadoop Distributed File System at a higher speed.
Apache Flume is used for fraud detections.
We can use Apache Flume in IoT applications.
Apache Flume can be used for aggregating machine
and sensor-generated data.
We can use Apache Flume in the alerting or SIEM.
Important features of FLUME
Flume has a flexible design based upon
streaming data flows.
It is fault tolerant and robust with multiple
failovers and recovery mechanisms.
Flume Big data has different levels of reliability
to offer which includes ‘best-effort
delivery’ and an ‘end-to-end
delivery’. Best-effort delivery does not
tolerate any Flume node failure whereas ‘end-
to-end delivery’ mode guarantees delivery
even in the event of multiple node failures.
Conti…
Flume carries data between sources and sinks.
This gathering of data can either be scheduled or
event-driven. Flume has its own query processing
engine which makes it easy to transform each
new batch of data before it is moved to the
intended sink.
Possible Flume sinks include HDFS and HBase.
Flume Hadoop can also be used to transport
event data including but not limited to network
traffic data, data generated by social media
websites and email messages.
Features of Flume
Apache Flume is a robust, fault-tolerant, and highly available
service.
It is a distributed system with tunable reliability mechanisms for fail-
over and recovery.
Apache Flume is horizontally scalable.
Apache Flume supports complex data flows such as multi-hop flows,
fan-in flows, fan-out flows. Contextual routing etc.
Apache Flume provides support for large sets of sources, channels,
and sinks.
Apache Flume can efficiently ingest log data from various servers
into a centralized repository.
With Flume, we can collect data from different web servers in real-
time as well as in batch mode.
We can import large volumes of data generated by social networking
sites and e-commerce sites into Hadoop DFS using Apache Flume.
Flume Advantages
Apache Flume enables us to store streaming
data into any of the centralized repositories
(such as HBase, HDFS).
Flume provides steady data flow between
producer and consumer during reading2/write
operations.
Flume supports the feature of contextual routing.
Apache Flume guarantees reliable message
delivery.
Flume is reliable, scalable, extensible, fault-
tolerant, manageable, and customizable.
Flume Disadvantages
Apache Flume offers weaker ordering
guarantees.
Apache Flume does not guarantee that the
messages reaching are 100% unique.
It has complex topology and reconfiguration
is challenging.
Apache Flume may suffer from scalability and
reliability issues.
Flume vs Sqoop vs HDFS in Hadoop
Flume Sqoop HDFS
The Apache Flume data load The Apache Sqoop data load It just stores the data
is driven by an event. is not event-driven. provided by any means.