Hadoop Ecosystem
CSE412 Big Data and Large Scale
Mr P 1
Computing
Balasubramanian
Hadoop Ecosystem …
Cont.
• HBase is a Column-based NoSQL database.
• Runs on top of HDFS and can handle any type of data.
• It allows for real-time processing and random read/write operations
in the data.
• Pig was developed for analysing large datasets and overcomes the
difficulty to write map and reduce functions.
• It consists of two components: Pig Latin and Pig Engine.
• Pig Latin is the Scripting Language that is similar to SQL.
• Pig Engine is the execution engine on which Pig Latin runs.
• Hive is a distributed data warehouse system developed by
Facebook.
• It allows for easy reading, writing, and managing files on HDFS.
• It has its own querying language for the purpose known as Hive
Querying Language (HQL)
• This makes it very easy for programmers to write MR functions
using simple HQL queries.
Hadoop Ecosystem …
Cont.
• Spark built on Scala but supports varied applications written in
Java, Python, etc.
• In-memory processing – faster Real-time processing.
Spark Core is the main execution engine for Spark
Spark SQL allows for querying structured data stored in
DataFrames or Hive tables
Streaming enables Spark to handle real-time data.
MLlib is a scalable machine learning library that will enable
you to perform data science
GraphX is a graph computation engine that enables users to
interactively build, transform, and reason about graph-
structured data at scale
• Kafka is distributed and has in-built partitioning, replication, and
fault-tolerance.
• It can handle streaming data and also allows businesses to analyze
data in real-time.
• Kafka is suitable for real-time data streaming use cases, such as
clickstream analysis, fraud detection, and real-time analytics.
Hadoop Ecosystem …
Cont.
BML 521 Big Data Management and
Mr P 4
Processing
Balasubramanian
Apache Pig
• Abstraction over MR.
• Tool/platform used to analyse large sets of
data
• Pig used with Hadoop perform all DML.
• Pig provides a high-level language known as
Pig Latin.
• Pig Latin Various operators Programmers
can develop their own functions for reading,
writing, and processing data.
Apache PIG History
• In 2006, Apache Pig developed as a
research project at Yahoo to create and
execute MR jobs on large dataset.
• In 2007, Apache Pig was open sourced via
Apache incubator.
• In 2008, the first release of Apache Pig
came out.
• In 2010, Apache Pig graduated as an
Apache top-level project.
Convert to MR
• All scripts are internally converted to MR
tasks.
• Apache Pig Pig Engine that accepts the
Pig Latin scripts as input and converts
those scripts into MapReduce jobs.
Easy !!!
• Programmers can perform MR tasks easily -
complex codes in Java.
• Multi-query approach - reducing the length
of codes.
• An operation require 200 LoC in Java done
with 10 LoC in Apache Pig.
• Pig reduces development time 16 x times
Pig Latin Code -
Wordcount
8 Lines of code in Pig latin
BML 521 Big Data Management and
9
Processing
PIG VS MR
Applications of PIG
• To process huge data sources such as
web logs.
• To perform data processing for
search platforms.
• To process time sensitive data loads.
Features of PIG
Rich set of operators: To perform
operations like join, sort, filter, etc.
Ease of programming: Like SQL
Optimization opportunities: Tasks in Pig -
optimize their execution automatically, focus
on semantics
Extensibility: Using existing operators,
users can develop their own functions to
read, process, write data.
Features of PIG
• UDF’s: Pig provides the facility to create
User-defined Functions in other programming
languages and invoke or embed them in Pig
Scripts.
• Handles all kinds of data:
• It stores the results in HDFS.
PIG Architecture and
Components
Comparision
BML 521 Big Data Management and
Mr P 15
Processing
Balasubramanian
Thank You
BML 521 Big Data Management and
Mr P 16
Processing
Balasubramanian