0% found this document useful (0 votes)
68 views23 pages

Apache Flume Architecture Overview

Uploaded by

sonia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views23 pages

Apache Flume Architecture Overview

Uploaded by

sonia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

APACHE FLUME

Introduction to Apache Flume


• Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates
and transports large amount of streaming data such as log files, events
from various sources like network traffic, social media, email messages
etc. to HDFS. Flume is a highly reliable & distributed.

• The main idea behind the Flume’s design is to capture streaming data
from various web servers to HDFS. It has simple and flexible
architecture based on streaming data flows. It is fault-tolerant and
provides reliability mechanism for Fault tolerance & failure recovery.
Advantages of Apache Flume
• Flume is scalable, reliable, fault tolerant and customizable for
different sources and sinks.
• Apache Flume can store data in centralized stores (i.e data is supplied
from a single store) like HBase & HDFS.
• Flume is horizontally scalable.
• If the read rate exceeds the write rate, Flume provides a steady flow
of data between read and write operations.
• Flume provides reliable message delivery. The transactions in Flume
are channel-based where two transactions (one sender & one receiver)
are maintained for each message.
• Using Flume, we can ingest data from multiple servers into Hadoop.

Advantages of Apache Flume
• It gives us a solution which is reliable and distributed and helps us in
collecting, aggregating and moving large amount of data sets like
Facebook, Twitter and e-commerce websites.
• It helps us to ingest online streaming data from various sources like
network traffic, social media, email messages, log files etc. in HDFS.
• It supports a large set of sources and destinations types.
Flume Architecture
Flume Architecture
There is a Flume agent which ingests the streaming data from various data sources to
HDFS. From the diagram, you can easily understand that the web server indicates the
data source. Twitter is among one of the famous sources for streaming data.
The flume agent has 3 components: source, sink and channel.
1. Source: It accepts the data from the incoming streamline and stores the data in
the channel.
2. Channel: In general, the reading speed is faster than the writing speed. Thus, we
need some buffer to match the read & write speed difference. Basically, the
buffer acts as a intermediary storage that stores the data being transferred
temporarily and therefore prevents data loss. Similarly, channel acts as the local
storage or a temporary storage between the source of data and persistent data
in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and
commits or writes the data in the HDFS permanently.
Apache Flume - Architecture
• The following illustration depicts the basic architecture of Flume. As shown in the
illustration, data generators (such as Facebook, Twitter) generate data which gets
collected by individual Flume agents running on them. Thereafter, a data collector
(which is also an agent) collects the data from the agents which is aggregated and
pushed into a centralized store such as HDFS or HBase.
Flume Event
• An event is the basic unit of the data transported inside Flume. It
contains a payload of byte array that is to be transported from the
source to the destination accompanied by optional headers. A typical
Flume event would have the following structure −
Flume Agent
• An agent is an independent daemon process (JVM) in Flume. It receives
the data (events) from clients or other agents and forwards it to its next
destination (sink or agent). Flume may have more than one agent.
Following diagram represents a Flume Agent
• a Flume Agent contains three main components namely, source,
channel, and sink.
Source
• A source is the component of an Agent which receives data from the
data generators and transfers it to one or more channels in the form of
Flume events.
• Apache Flume supports several types of sources and each source
receives events from a specified data generator.
• Example − Avro source, Thrift source, twitter 1% source etc.
Channel
• A channel is a transient store which receives the events from the source
and buffers them till they are consumed by sinks. It acts as a bridge
between the sources and the sinks.
• These channels are fully transactional and they can work with any
number of sources and sinks.
• Example − JDBC channel, File system channel, Memory channel, etc.
Sink
• A sink stores the data into centralized stores like HBase and HDFS. It
consumes the data (events) from the channels and delivers it to the
destination. The destination of the sink might be another agent or the
central stores.
• Example − HDFS sink
• Note − A flume agent can have multiple sources, sinks and channels
Additional Components of Flume Agent
• Interceptors
• Channel Selectors
• Default channel selectors
• Multiplexing channel selectors
• Sink Processors
Interceptors
• Interceptors are used to alter/inspect flume events which are
transferred between source and channel.
Channel Selectors
These are used to determine which channel is to be opted to transfer the
data in case of multiple channels. There are two types of channel selectors

•Default channel selectors − These are also known as replicating channel
selectors they replicates all the events in each channel.
•Multiplexing channel selectors − These decides the channel to send an
event based on the address in the header of that event.
Sink Processors
• These are used to invoke a particular sink from the selected group of
sinks. These are used to create failover paths for your sinks or load
balance events across multiple sinks from a channel.
Interceptors
• Interceptors are used to alter/inspect flume events which are
transferred between source and channel.
Apache Flume - Data Flow
• Flume is a framework which is used to move log data into HDFS.
Generally events and log data are generated by the log servers and
these servers have Flume agents running on them. These agents receive
the data from the data generators.
• The data in these agents will be collected by an intermediate node
known as Collector. Just like agents, there can be multiple collectors in
Flume.
• Finally, the data from all these collectors will be aggregated and pushed
to a centralized store such as HBase or HDFS. The following diagram
explains the data flow in Flume.
Multi-hop Flow
• Within Flume, there can be multiple agents and before reaching the
final destination, an event may travel through more than one agent.
This is known as multi-hop flow.
Fan-out Flow
The dataflow from one source to multiple channels is known as fan-out
flow. It is of two types −
•Replicating − The data flow where the data will be replicated in all the
configured channels.
•Multiplexing − The data flow where the data will be sent to a selected
channel which is mentioned in the header of the event.
Failure Handling
• In Flume, for each event, two transactions take place: one at the sender
and one at the receiver. The sender sends events to the receiver. Soon
after receiving the data, the receiver commits its own transaction and
sends a “received” signal to the sender. After receiving the signal, the
sender commits its transaction. (Sender will not commit its transaction
till it receives a signal from the receiver.)
THANK YOU

15

You might also like