Apache Flume Architecture Overview

Uploaded by

sonia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views23 pages

Apache Flume Architecture Overview

Uploaded by

sonia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

APACHE FLUME

Introduction to Apache Flume

• Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates
and transports large amount of streaming data such as log files, events
from various sources like network traffic, social media, email messages
etc. to HDFS. Flume is a highly reliable & distributed.

• The main idea behind the Flume’s design is to capture streaming data
from various web servers to HDFS. It has simple and flexible
architecture based on streaming data flows. It is fault-tolerant and
provides reliability mechanism for Fault tolerance & failure recovery.
Advantages of Apache Flume
• Flume is scalable, reliable, fault tolerant and customizable for
different sources and sinks.
• Apache Flume can store data in centralized stores (i.e data is supplied
from a single store) like HBase & HDFS.
• Flume is horizontally scalable.
• If the read rate exceeds the write rate, Flume provides a steady flow
of data between read and write operations.
• Flume provides reliable message delivery. The transactions in Flume
are channel-based where two transactions (one sender & one receiver)
are maintained for each message.
• Using Flume, we can ingest data from multiple servers into Hadoop.
•
Advantages of Apache Flume
• It gives us a solution which is reliable and distributed and helps us in
collecting, aggregating and moving large amount of data sets like
Facebook, Twitter and e-commerce websites.
• It helps us to ingest online streaming data from various sources like
network traffic, social media, email messages, log files etc. in HDFS.
• It supports a large set of sources and destinations types.
Flume Architecture
Flume Architecture
There is a Flume agent which ingests the streaming data from various data sources to
HDFS. From the diagram, you can easily understand that the web server indicates the
data source. Twitter is among one of the famous sources for streaming data.
The flume agent has 3 components: source, sink and channel.
1. Source: It accepts the data from the incoming streamline and stores the data in
the channel.
2. Channel: In general, the reading speed is faster than the writing speed. Thus, we
need some buffer to match the read & write speed difference. Basically, the
buffer acts as a intermediary storage that stores the data being transferred
temporarily and therefore prevents data loss. Similarly, channel acts as the local
storage or a temporary storage between the source of data and persistent data
in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and
commits or writes the data in the HDFS permanently.
Apache Flume - Architecture
• The following illustration depicts the basic architecture of Flume. As shown in the
illustration, data generators (such as Facebook, Twitter) generate data which gets
collected by individual Flume agents running on them. Thereafter, a data collector
(which is also an agent) collects the data from the agents which is aggregated and
pushed into a centralized store such as HDFS or HBase.
Flume Event
• An event is the basic unit of the data transported inside Flume. It
contains a payload of byte array that is to be transported from the
source to the destination accompanied by optional headers. A typical
Flume event would have the following structure −
Flume Agent
• An agent is an independent daemon process (JVM) in Flume. It receives
the data (events) from clients or other agents and forwards it to its next
destination (sink or agent). Flume may have more than one agent.
Following diagram represents a Flume Agent
• a Flume Agent contains three main components namely, source,
channel, and sink.
Source
• A source is the component of an Agent which receives data from the
data generators and transfers it to one or more channels in the form of
Flume events.
• Apache Flume supports several types of sources and each source
receives events from a specified data generator.
• Example − Avro source, Thrift source, twitter 1% source etc.
Channel
• A channel is a transient store which receives the events from the source
and buffers them till they are consumed by sinks. It acts as a bridge
between the sources and the sinks.
• These channels are fully transactional and they can work with any
number of sources and sinks.
• Example − JDBC channel, File system channel, Memory channel, etc.
Sink
• A sink stores the data into centralized stores like HBase and HDFS. It
consumes the data (events) from the channels and delivers it to the
destination. The destination of the sink might be another agent or the
central stores.
• Example − HDFS sink
• Note − A flume agent can have multiple sources, sinks and channels
Additional Components of Flume Agent
• Interceptors
• Channel Selectors
• Default channel selectors
• Multiplexing channel selectors
• Sink Processors
Interceptors
• Interceptors are used to alter/inspect flume events which are
transferred between source and channel.
Channel Selectors
These are used to determine which channel is to be opted to transfer the
data in case of multiple channels. There are two types of channel selectors
−
•Default channel selectors − These are also known as replicating channel
selectors they replicates all the events in each channel.
•Multiplexing channel selectors − These decides the channel to send an
event based on the address in the header of that event.
Sink Processors
• These are used to invoke a particular sink from the selected group of
sinks. These are used to create failover paths for your sinks or load
balance events across multiple sinks from a channel.
Interceptors
• Interceptors are used to alter/inspect flume events which are
transferred between source and channel.
Apache Flume - Data Flow
• Flume is a framework which is used to move log data into HDFS.
Generally events and log data are generated by the log servers and
these servers have Flume agents running on them. These agents receive
the data from the data generators.
• The data in these agents will be collected by an intermediate node
known as Collector. Just like agents, there can be multiple collectors in
Flume.
• Finally, the data from all these collectors will be aggregated and pushed
to a centralized store such as HBase or HDFS. The following diagram
explains the data flow in Flume.
Multi-hop Flow
• Within Flume, there can be multiple agents and before reaching the
final destination, an event may travel through more than one agent.
This is known as multi-hop flow.
Fan-out Flow
The dataflow from one source to multiple channels is known as fan-out
flow. It is of two types −
•Replicating − The data flow where the data will be replicated in all the
configured channels.
•Multiplexing − The data flow where the data will be sent to a selected
channel which is mentioned in the header of the event.
Failure Handling
• In Flume, for each event, two transactions take place: one at the sender
and one at the receiver. The sender sends events to the receiver. Soon
after receiving the data, the receiver commits its own transaction and
sends a “received” signal to the sender. After receiving the signal, the
sender commits its transaction. (Sender will not commit its transaction
till it receives a signal from the receiver.)
THANK YOU

Understanding Apache Flume Architecture
No ratings yet
Understanding Apache Flume Architecture
8 pages
Overview of Apache Flume Architecture
No ratings yet
Overview of Apache Flume Architecture
31 pages
MapReduce Job Architecture Overview
100% (1)
MapReduce Job Architecture Overview
46 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
5 pages
Hive Database Creation and Analytics
No ratings yet
Hive Database Creation and Analytics
10 pages
RGPV BigData ExamNotes
No ratings yet
RGPV BigData ExamNotes
19 pages
MongoDB in Big Data Analytics
No ratings yet
MongoDB in Big Data Analytics
14 pages
Understanding Artificial Intelligence Basics
No ratings yet
Understanding Artificial Intelligence Basics
38 pages
Introduction to Apache Pig in Big Data
No ratings yet
Introduction to Apache Pig in Big Data
38 pages
Apache Mahout Machine Learning Guide
No ratings yet
Apache Mahout Machine Learning Guide
25 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
46 pages
MapReduce Types and Input Formats
No ratings yet
MapReduce Types and Input Formats
65 pages
Data Ingress and Egress in Hadoop
No ratings yet
Data Ingress and Egress in Hadoop
13 pages
Matrix Multiplication in Hadoop Lab
No ratings yet
Matrix Multiplication in Hadoop Lab
44 pages
K-Means Clustering in Big Data Analytics
No ratings yet
K-Means Clustering in Big Data Analytics
40 pages
Hive vs Pig in Big Data Analysis
No ratings yet
Hive vs Pig in Big Data Analysis
41 pages
Reminder Application System Report
No ratings yet
Reminder Application System Report
59 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
29 pages
Overview of Apache Flink Framework
No ratings yet
Overview of Apache Flink Framework
6 pages
Network Programming Paradigm Explained
No ratings yet
Network Programming Paradigm Explained
5 pages
Data Mining Primitives and Languages
No ratings yet
Data Mining Primitives and Languages
25 pages
Understanding NoSQL Databases and Big Data
100% (1)
Understanding NoSQL Databases and Big Data
28 pages
Overview of Hortonworks Data Platform
100% (1)
Overview of Hortonworks Data Platform
56 pages
Introduction to Hadoop: History & HDFS
100% (1)
Introduction to Hadoop: History & HDFS
43 pages
Overview of Apache Pig and Pig Latin
No ratings yet
Overview of Apache Pig and Pig Latin
55 pages
Install Hadoop on Windows 10 Guide
No ratings yet
Install Hadoop on Windows 10 Guide
29 pages
Data Analytics Syllabus Overview
No ratings yet
Data Analytics Syllabus Overview
48 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
3 pages
MapReduce Job Execution Overview
No ratings yet
MapReduce Job Execution Overview
24 pages
Big Data and Hadoop Laboratory Course
No ratings yet
Big Data and Hadoop Laboratory Course
2 pages
Big Data Mining: Statistical Modeling & ML
100% (1)
Big Data Mining: Statistical Modeling & ML
27 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
9 pages
Understanding MongoDB Basics and Features
No ratings yet
Understanding MongoDB Basics and Features
37 pages
Hadoop and Spark Installation Guide
No ratings yet
Hadoop and Spark Installation Guide
34 pages
HDFS Design and Concepts Overview
No ratings yet
HDFS Design and Concepts Overview
16 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
Cloudera Data Analyst Training Overview
No ratings yet
Cloudera Data Analyst Training Overview
3 pages
HDFS Limitations in Big Data Scenarios
No ratings yet
HDFS Limitations in Big Data Scenarios
24 pages
Building Food-Ordering Chatbots with Dialogflow
No ratings yet
Building Food-Ordering Chatbots with Dialogflow
39 pages
Big Data Overview and MapReduce Insights
No ratings yet
Big Data Overview and MapReduce Insights
22 pages
Stream Computing: Amit Kumar
No ratings yet
Stream Computing: Amit Kumar
32 pages
Big Data Analytics Overview and Frameworks
No ratings yet
Big Data Analytics Overview and Frameworks
11 pages
Data Stream Mining Techniques
No ratings yet
Data Stream Mining Techniques
67 pages
Data Analysis Using Hadoop Framework
No ratings yet
Data Analysis Using Hadoop Framework
4 pages
Understanding Distributed Scheduling Systems
No ratings yet
Understanding Distributed Scheduling Systems
30 pages
Inter and Trans-Firewall Analytics Overview
No ratings yet
Inter and Trans-Firewall Analytics Overview
9 pages
Install Hadoop 2.8.0 on Windows 10
No ratings yet
Install Hadoop 2.8.0 on Windows 10
10 pages
BAD601 Big Data Analytics Lab Manual
No ratings yet
BAD601 Big Data Analytics Lab Manual
34 pages
Overview of Apache Pig and Hive
No ratings yet
Overview of Apache Pig and Hive
83 pages
MapReduce: Types, Features, and Formats
No ratings yet
MapReduce: Types, Features, and Formats
26 pages
HDFS Command Line Cheat Sheet
No ratings yet
HDFS Command Line Cheat Sheet
26 pages
BDA Classification with Mahout Techniques
No ratings yet
BDA Classification with Mahout Techniques
72 pages
Hive Basics: Tables, Partitioning, Bucketing
No ratings yet
Hive Basics: Tables, Partitioning, Bucketing
9 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
Monitoring Hadoop MapReduce Jobs
No ratings yet
Monitoring Hadoop MapReduce Jobs
57 pages
Data Mining Techniques with WEKA
No ratings yet
Data Mining Techniques with WEKA
53 pages
Monthly Mutual Fund Investment Summary
No ratings yet
Monthly Mutual Fund Investment Summary
24 pages
Advanced Recommendation Systems Overview
No ratings yet
Advanced Recommendation Systems Overview
25 pages
Cloud Management and VM Provisioning
No ratings yet
Cloud Management and VM Provisioning
24 pages
Understanding Apache Flume Architecture
No ratings yet
Understanding Apache Flume Architecture
16 pages
Understanding YARN in Hadoop 2.0
No ratings yet
Understanding YARN in Hadoop 2.0
34 pages
YARN Scheduler Overview and Types
No ratings yet
YARN Scheduler Overview and Types
21 pages
Mahout Interview Questions Overview
No ratings yet
Mahout Interview Questions Overview
20 pages
Relational vs NoSQL Databases Explained
No ratings yet
Relational vs NoSQL Databases Explained
4 pages
RHIPE Architecture in Big Data Analysis
No ratings yet
RHIPE Architecture in Big Data Analysis
49 pages
Affordable HPC FPGA Clusters with Raspberry Pi
No ratings yet
Affordable HPC FPGA Clusters with Raspberry Pi
1 page
Quanta Computer Block Diagram Overview
No ratings yet
Quanta Computer Block Diagram Overview
33 pages
No. 1 Logs Marketplace Overview
No ratings yet
No. 1 Logs Marketplace Overview
1 page
SAP CPI Interview Questions Overview
No ratings yet
SAP CPI Interview Questions Overview
3 pages
Fortinet Network Security Support Engineer Study Guide For Fortios 72
No ratings yet
Fortinet Network Security Support Engineer Study Guide For Fortios 72
536 pages
Overview of NG-PON2 Technology
100% (1)
Overview of NG-PON2 Technology
16 pages
AlienVault OTX User Guide
100% (1)
AlienVault OTX User Guide
44 pages
OLT and ONT Installation Guide
No ratings yet
OLT and ONT Installation Guide
37 pages
DevOps Engineer with Jenkins Expertise
No ratings yet
DevOps Engineer with Jenkins Expertise
6 pages
Understanding Honeypots in Cybersecurity
No ratings yet
Understanding Honeypots in Cybersecurity
24 pages
Dahua IPC-HFW1100S en Datasheet
No ratings yet
Dahua IPC-HFW1100S en Datasheet
3 pages
Areca ARC-1680 SAS RAID Adapters
No ratings yet
Areca ARC-1680 SAS RAID Adapters
2 pages
Communication Protocol Specification of Zybio Hematology Analyzer
No ratings yet
Communication Protocol Specification of Zybio Hematology Analyzer
29 pages
IEC 61850 Benefits in Substation Automation
No ratings yet
IEC 61850 Benefits in Substation Automation
21 pages
Huzzaz: Video Collection Insights
No ratings yet
Huzzaz: Video Collection Insights
24 pages
Internet Communication Methods Explained
No ratings yet
Internet Communication Methods Explained
9 pages
Riverbed Opnet 17.5 Model Files
No ratings yet
Riverbed Opnet 17.5 Model Files
282 pages
Toshiba Recommends Windows 7.: Model
No ratings yet
Toshiba Recommends Windows 7.: Model
2 pages
Lenovo S2D with Windows Server 2016
No ratings yet
Lenovo S2D with Windows Server 2016
4 pages
Understanding Cisco' Next Generation SD-WAN Solution: Danny Blais & Luis Cruz
No ratings yet
Understanding Cisco' Next Generation SD-WAN Solution: Danny Blais & Luis Cruz
50 pages
Configure MAC Masquerade in BIG-IP
No ratings yet
Configure MAC Masquerade in BIG-IP
6 pages
tmux Keyboard Shortcuts Cheat Sheet
No ratings yet
tmux Keyboard Shortcuts Cheat Sheet
2 pages
IBM DS8000 Series User Guide
No ratings yet
IBM DS8000 Series User Guide
592 pages
Optical Wireless Communication Overview
No ratings yet
Optical Wireless Communication Overview
25 pages
Benefits of Computer Maintenance
No ratings yet
Benefits of Computer Maintenance
30 pages
Layer 2 Redundancy: STP Overview
No ratings yet
Layer 2 Redundancy: STP Overview
31 pages
M-Series Hardware Specifications Guide
100% (1)
M-Series Hardware Specifications Guide
19 pages
Multiservice Traffic in Georgian Networks
No ratings yet
Multiservice Traffic in Georgian Networks
5 pages
4 WCDMA Load Control Algorithm and Parameters RAN10
No ratings yet
4 WCDMA Load Control Algorithm and Parameters RAN10
166 pages
Cybersecurity Student Profile: Skills & Experience
No ratings yet
Cybersecurity Student Profile: Skills & Experience
2 pages

Apache Flume Architecture Overview

Uploaded by

Apache Flume Architecture Overview

Uploaded by

APACHE FLUME

Introduction to Apache Flume

You might also like